Editorial Boards Informatica is a journal primarily covering intelligent systems in the European computer science, informatics and cognitive com­munity; scientifc and educational as well as technical, commer­cial and industrial. Its basic aim is to enhance communications between different European structures on the basis of equal rights and international refereeing. It publishes scientifc papers accep­ted by at least two referees outside the author’s country. In addi­tion, it contains information about conferences, opinions, critical examinations of existing publications and news. Finally, major practical achievements and innovations in the computer and infor­mation industry are presented through commercial publications as well as through independent evaluations. Editing and refereeing are distributed. Each editor from the Editorial Board can conduct the refereeing process by appointing two new referees or referees from the Board of Referees or Edi­torial Board. Referees should not be from the author’s country. If new referees are appointed, their names will appear in the list of referees. Each paper bears the name of the editor who appointed the referees. Each editor can propose new members for the Edi­torial Board or referees. Editors and referees inactive for a longer period can be automatically replaced. Changes in the Editorial Board are confrmed by the Executive Editors. The coordination necessary is made through the ExecutiveEdi-tors whoexamine the reviews, sort the accepted articlesand main­tain appropriate international distribution. The Executive Board is appointed by the Society Informatika. Informatica is parti­ally supported by the Slovenian Ministry of Higher Education, Science andTechnology. Each author is guaranteed to receive the reviews of his article. When accepted, publication in Informatica is guaranteed in less than one year after the Executive Editors receive the corrected version of the article. Executive Editor – Editor in Chief Matjaž Gams Jamova 39, 1000 Ljubljana, Slovenia Phone: +38614773 900,Fax: +38612519385 matjaz.gams@ijs.si http://dis.ijs.si/mezi/matjaz.html Editor Emeritus AntonP. Železnikar Volariˇceva 8, Ljubljana, Slovenia s51em@lea.hamradio.si http://lea.hamradio.si/~s51em/ Executive Associate Editor -Deputy Managing Editor Mitja Luštrek, Jožef Stefan Institute mitja.lustrek@ijs.si Executive Associate Editor -Technical Editor DragoTorkar, Jožef Stefan Institute Jamova 39, 1000 Ljubljana, Slovenia Phone: +38614773 900,Fax: +38612519385 drago.torkar@ijs.si Contact Associate Editors Europe, Africa: Matjaz Gams N. and S. America: Shahram Rahimi Asia, Australia: Ling Feng Overview papers: Maria Ganzha,Wies awPaw owski, Aleksander Denisiuk Editorial Board Juan Carlos Augusto (Argentina) Vladimir Batagelj (Slovenia) Francesco Bergadano (Italy) Marco Botta (Italy) Pavel Brazdil (Portugal) Andrej Brodnik (Slovenia) Ivan Bruha (Canada) Wray Buntine (Finland) Zhihua Cui (China) Aleksander Denisiuk (Poland) Hubert L. Dreyfus (USA) Jozo Dujmovi´ c (USA) Johann Eder (Austria) George Eleftherakis (Greece) Ling Feng (China) VladimirA.Fomichov (Russia) Maria Ganzha (Poland) Sumit Goyal (India) Marjan Gušev (Macedonia) N. Jaisankar (India) Dariusz Jacek Jakczak (Poland) Dimitris Kanellopoulos (Greece) Samee Ullah Khan (USA) Hiroaki Kitano (Japan) IgorKononenko (Slovenia) MiroslavKubat (USA) Ante Lauc (Croatia) Jadran Lenarciˇˇ c (Slovenia) Shiguo Lian (China) Suzana Loskovska (Macedonia) Ramon L. de Mantaras (Spain) Natividad Martínez Madrid (Germany) Sando Martinciˇ´c (Croatia) c-Ipiši´Angelo Montanari (Italy) Pavol Návrat (Slovakia) Jerzy R. Nawrocki (Poland) Nadia Nedjah (Brasil) Franc Novak (Slovenia) MarcinPaprzycki (USA/Poland) Wies awPaw owski (Poland) Ivana Podnar Žarko (Croatia) Karl H. Pribram (USA) Luc De Raedt (Belgium) Shahram Rahimi (USA) Dejan Rakovi´ c (Serbia) Jean Ramaekers (Belgium) Wilhelm Rossak (Germany) Ivan Rozman (Slovenia) Sugata Sanyal (India) Walter Schempp (Germany) Johannes Schwinn (Germany) Zhongzhi Shi (China) Oliviero Stock (Italy) RobertTrappl (Austria) TerryWinograd (USA) Stefan Wrobel (Germany) Konrad Wrona (France) XindongWu (USA) Yudong Zhang (China) Rushan Ziatdinov (Russia&Turkey) Counterexamplesin Model Checking –ASurvey Hichem Debbi Department of Computer Science, University of Mohamed Boudiaf, M’sila, Algeria E-mail: hichem.debbi@gmail.com Overview paper Keywords:model checking, counterexamples, debugging Received: December 9, 2016 Model checking is a formal method used for the verifcation of fnite-state systems. Given a system model and such specifcation, which is a set of formal properties, the model checker verifes whether or not the model meets the specifcation. One of the major advantages of model checking over other formal methodsits abilityto generatea counterexample whenthe modelfalsifesthe specifcation. Althoughthe main purpose of the counterexample is to help the designer to fnd the source of the error in complex systems design, the counterexample has been also used for manyother purposes, either in the context of model checking itself or in other domains in which model checking is used. In this paper, we will survey algorithms for counterexample generation, from classical algorithms in graph theory to novel algorithms for producing small and indicative counterexamples. We will also show how counterexamples are useful for debugging, and how we can beneft from delivering counterexamples for other purposes. Povzetek: Pregledni ˇcnihavtomatov, clanek se ukvarjas protiprimeriv formalni metodi za preverjanjekon ˇtj. sistemov manjše raˇcikotTuringovi stroji. Protiprimerikoristijo snovalcem naveˇcinom, cunske moˇc naˇpredvsemkot na ˇ cin preverjanja pravilnosti delovanja. 1 Introduction Model checkingisaformal methodusedfortheverifcation of fnite-state systems. Givenasystem model and such spe­cifcation, which is a set of formal properties in temporal logics likeLTL [109] and CTL [28, 52], the model checker verifes whether or not the model meets the specifcation. One of the major advantages of model checking over ot­her formal methodsits abilityto generateacounterexample when the modelfalsifes such specifcation. The counterex­ample is an error trace, by analysing it, the user can locate the source of the error. The original algorithm for counte­rexample generationwas proposedby [31], andwas imple­mented in most symbolic model checkers. This algorithm of generating linear counterexamples forACTL, which is a fragment of CTL, was later extended to handle arbitrary ACTL properties using the notion of tree-like counterex­amples [36]. Since then, many works have addressed this issue in model checking. Counterexample generation has its originsin graph theory through the problemoffaircy­cle and Strongly Connected Component (SCC) detection, because model checking algorithms of temporal logics em­ploy cycle detection and technically a fnite system model is determining a transition graph [32]. The original algo­rithm forfaircycle detectioninLTL and CTL modelwas proposed by [53]. Since then, many variants of this algo­rithm and newalternatives were proposed forLTL and CTL model checking.In section3we willinvestigate briefythe problemoffaircycles and SCCs detection. Whiletheearlyworks introducedby[28,52]haveinves­tigated the problem of generating counterexample so wi­dely, which led to practical implementation within well-known model checkers, the open problem that emerged was the quality of the counterexample generated and how it really serves the purpose. Therefore, in the last decade manypapershave considered this issue, earlierintermsof structure[36], by proposing the notion of tree-like counte­rexamples to handleACTL properties, and followed later by the works investigating the quality of the counterexam­ple mostly in terms of length to be useful later for debug­ging. In section 3, we will investigate the methods propo­sed for generating minimal, small and indicativecounterex­amples in conventional model checking. Model checking algorithms are classifed in two main categories, explicit and symbolic. While explicit algorithms are applied di­rectly on the transition system, symbolic algorithms em-ployspecifc data structures. Generally, the explicit algo­rithms are adopted forLTL model checking, whereas sym­bolic algorithms are adopted for CTL model checking. In this section, the algorithms for generating small counte­rexamples are presented with respect to each type of al­gorithms. However, generating small and indicative counterexam­ples only is not enough for understanding the error. There­fore, counterexamples analysis is inevitable. Many works in model checking have addressed the analysis of counte­rexamples to better understand the error. In section 4, we will investigate the approaches that aim to help the designer to localize the source of the error given counterexamples. In this section, we consider that most of these methods range into two main categories: those that are applied on the counterexample itself without anyneed to other infor­mation, and those that require successful runs or witnesses to be compared with the counterexamples. Probabilistic model checking has appeared as an exten­sion of model checking for analyzing systems that exhi­bit stochastic behavior. Several case studies in several domains have been addressed from randomized distribu­ted algorithms and network protocols to biological sys­tems and cloud computing environments. These systems are described usually using Discrete-Time Markov Chains (DTMC), Continuous Time Markov Chains (CTMC) or Markov Decision Processes (MDP), and verifed against properties specifedin Probabilistic ComputationTree Lo­gic (PCTL)[78] or Continuous Stochastic Logic (CSL) [9, 10]. In probabilistic model checking (PMC) counterex­ample generation has a quantitative aspect. The counterex­ample is a set of paths in which a path formula holds, and their accumulative probability mass violates the probabi­lity bound. Due to its specifc nature, we specify section 5 for counterexample generation in probabilistic model checking. As it was done in conventional model checking, addressing the error explanation in the probabilistic mo­del checking is highly required, especially that probabilis­tic counterexample consists of multiple paths instead of a single path, and it is probabilistic. So, in this section we will also investigate the counterexample analysis in PMC. The most important thing about counterexample is that itdoesnotjust serveasadebuggingtool,butitisalsoused to refne the model checking process itself, through Coun­terexample Guided Abstraction Refnement(CEGAR)[37]. CEGARisan automaticverifcation method mainly propo­sed to tackle the problem of state-explosion problem, and it is based on the information obtained from the counterex­amples generated. In section 6, we will show how counte­rexample contributesto thisfamous methodofverifcation. Testing is an automated method used to verify the qua­lity of software. When we use model checking to generate test cases, this is called model-based testing. This met­hod has known a great success in the industry through the use offamous model checkers such as SPIN, NuSMV and JavaPathfnder. Model checkingisusedfortestingfortwo main reasons: frst, because model checkingis fully auto­mated, and secondly and more importantly because it de­livers counterexamples when the property is not satisfed. In section 7, we will show how counterexample serves as a good tool for generating test cases. Although counterexample generation is in the heart of model checking, not all model checkers deliver counte­rexamples to the user. In the last section, we will review thefamous tools that generate counterexamples. Section9 concludes the paper, where some brief open problems and future directions are presented. H. Debbi 2 Preliminaries and defnitions Kripke Structure. A Kripke structure is a tuple M = (AP, S, s0, R, L) that consists of a set AP of atomic pro­positions, a set S of states, s0 . S an initial state, a to­tal transition relation R . S × S and a labelling function L : W › 2AP that labels each state with a set of atomic propositions. BhiAutomaton. A Bhi automaton is a tuple B = P (S, s0,E, ,F ) where S is a fnite set of states, s0 . P S is the initial state, E . S × S is the transition relation, is a fnite alphabet, and F . S is the set of accepting or fnal states. WeuseBhi automatontodefneasetof infnitewords of an alphabet. Apath is a sequence of states (s0s1...sk), k . 1 such that (si,si+1) . E for all 1 . ik is accepting if sk...sl forms an accepting cycle. We call a path that starts at the initial state and reaches an accep­ting cycle an accepting path or counterexample (see Figure 1). Aminimal counterexample is an accepting path with a minimal number of transitions. Strongly Connected Component. Agraph is a pair G = (V, E), where V is a set of states and E . V × V is the set of edges. A path is a sequence of states (v1, ..., vk), k . 1 such that (vi,vi+1) . E for all 1 0 for all i . 0. Afnite path is the fnite prefx of an infnite path. We defne a set of paths starting from a state s0 by P aths(s0). The probability of a fnite path is calculated as follows: P (. . P athsQ(s0)|s0s1...snis a prefx of .)= P (si,si+1) i.0 , .} is a comparison operator, and p is a probability thres­hold. The path formulas . are formed according to the following grammar: . ::= .1U.2|.1W.2|.1U.n.2|.1W.n.2 Where .1 and .2 are state formulas and n . N. As in CTL, the temporal operators(Ufor strong until, Wfor weak (unless) until and their bounded variants) are requi­red to be immediately preceded by the operator P. The PCTL formulaisa state formula, wherepath formulasonly occur inside the operator P. The operator Pcan be seen as a quantifcation operator for both the operators . (universal quantifcation) and . (existential quantifcation), since the properties are representing quantitative requirements. The semantics of a PCTL formula over a state s (or a path .)in a DTMC model D =(S, sinit, P, L) can be defned by a satisfaction relation denoted by |=. The sa­tisfaction of P~p(.) on DTMC depends on the probability massofa setof paths satisfying .. This set is considered as a countableunionofcylindersets,sothat,its measurability is ensured. The semantics of PCTL state formulas for DTMC is de­fned as follows: s |= true . true s |= a . a . L(s) s |= ¬. . s 6|= . s |= .1 . .2 . s |= .1 . s |= .2 s |= P~p(.) . P (s |= .) ~ p Given a path . = s0s1... in D and an integer j . 0, where .[j]= sj,the semantics of PCTL path formulas for DTMC is defned as for CTL as follows: . |= .1U.2 ..j . 0.. [j] |= .2 . (.0 . k < j.. [k] |= .1) . |= .1W.2 . . |= .1U.2 . (.k . 0.. [k] |= .1) . |= .1U.n.2 ..0 . j . n.. [j] |= .2 . (.0 . k< j.. [k] |= .1) . |= .1W.n.2 . . |= .1U.n.2 . (.0 . k . n.. [k] |= .1) For specifying properties of CTMC, we use the Conti­nuous Stochastic Logic (CSL). CSL has the same syntax and semanticsas PCTL,except thatinCSLthe time bound in bounded until formula can be presented as an interval of non-negative reals. Before verifying CSL properties over CTMC, the CTMC has to be transformed to its embed­ded DTMC. Therefore, further description of CTMC mo­del checkingisbeyondthe scopeofthis paper.We referto [9, 10] for further details. Generally, two types of properties can be expressed using temporal logics: Safety and Liveness. Safety pro­prieties state that something bad never happens, a simple example of that is theLTL formula G¬error that means that error never occurs. Liveness properties state that so-methinggoodeventually happens,asimpleexampleofthat is the CTL formula (.Greq ›.F grant) that means that every request is eventually granted. 3 Counterexamples generation Counterexample generation has its origins in graph theory through the problem of cycle detection. Cycle detection is an important issue in the heart of model checking, either explicit or symbolic model checking. To deal with this is­sue, various algorithms were proposed for both LTL and CTL model checking. Explicit state model checking is ba­sed on Bhi automaton, which is a type of .-automata. The fairness condition relies on several sets of accepting states, where the acceptance condition is visiting the accep­tance set infnitely often. So, a run is accepting if only if it contains a state in every accepting set infnitely often. As a result, the emptiness of the language is based on checking the non-existenceof thefaircycle or equivalently thefair non-trivial strongly connected component (SCC) that inter­sects each accepting set. In the case of non-emptiness, the accepting runisa signof propertyfailure, and asa result it is rendered as an error trace. We call this error trace a counterexample. So, the counterexample is typically pre­sented by a fnite stem followed by a fnite cycle. Several algorithms were proposed to fnd counterexamples in re­asonable time, where fnding the shortest counterexample has been proved to be a NP-Complete problem [31, 82]. Tofndfair SCCs, Depth First Search (DFS) and Breadth First Search (BFS) algorithms are used. The main algo­rithm employing DFS is theTarjan’s algorithm [126] that is based on manipulating the states of the graph explicitly. This algorithm is used to generate linear counterexamples inLTLverifcation and showed promising results [43, 129]. It is also adopted in probabilistic model checking to gene­rate probabilistic counterexamples for lower-bounded pro­perties, through fnding bottom strongly connected com­ponents (BSCCs)[5]. BSCC is defned as an SCCBfrom which no state outsideBis reachable from B. Finding the set of BSCCs over the probabilistic models is an important issuefortheverifcationof PCTLandCSL properties.Tar­jan’s algorithm runs in linear time, but as the number of states grows, it simply becomes infeasible. As a result, the symbolic-based algorithms are proposed as a solution. In contrast to explicit algorithms, symbolic algorithms [17, 19] employBFS and can describe large sets in a com­pact manner using characteristic functions. Several symbo­lic algorithms were proposed for computing the set of states that contains all thefair SCCs, without enumerating them [32,84,128].We referto thesealgorithmsas SCC-hullal­gorithms. Currently, most of the symbolic model checkers are employing Emerson’s algorithm due to its high perfor­mance, and it was proven by [58] that both of the algo­rithms [52] and [31] can work in a complementary way. Other works [83, 136] proposed algorithms based on enu­merating the SCCs, we refer to these algorithms as symbo­lic SCC-enumeration algorithms. Different approaches for generating counterexamples are proposed regarding the two types presented before. Clarke et al. [31] proposed a hull-based approach based on Emer­son’s algorithm by searching a cycle in a fair SCC close to the initial state. Another approach by Hojati [84] was also employedbyotherworks for generating counterexam­ples that use isolations techniques of the SCCs [95]. Using Emerson’s algorithm in a combinatory way with SCC-Enumeration algorithm is possible,but is still not guaran­teed to get a counterexample of short length. Ravi et al. [111] introduceda careful analysisofeachtypeof theseal­gorithms. Since there is no guarantee to fnd terminal SCCs close to the initial state, fnding short counterexamples was still a trade-off and an open problem, and thus it was in­vestigated later by many researchers in both explicit and symbolic model checking. 3.1 Short counterexamples in explicit-state model checking Acounterexampleinthe Bhi automatonisapath. = ß. where ß is a path without loop from the initial state to an accepting state, and . is a loop around this accepting state. So that, a minimal counterexample is simply a counterex­ample with a minimal number of transitions. More for­ 1: procedure DFS(s) 2: Mark(hs, 0i) 3: for eachsuccessor t of s do 4: if ht, 0i not marked then 5: DFS(t) 6: end if 7: endfor 8: if accepting(s) then seed := s; NDFS(s) 9: end if 10: end procedure 11: procedure NDFS(s) 12: Mark(hs, 1i) 13: for eachsuccessor t of s do 14: if ht, 1i not marked then 15: NDFS(t) 16: else 17: if (t==seed) then report cycle 18: end if 19: end if 20: endfor 21: end procedure Figure 2: Nested Depth First Search Algorithm[130]. mally,acounterexample . = ß. is minimal if (|ß|+|.|) . (|ß0|+|.0|) for anypath .0 = ß0.0.With respect to this de­fnition,a counterexample has at least one transition. Many algorithms consider the issue of generating counterexam­plesgiven Bhi automaton [130,85, 112]. All theseworks employ Nested-Depth First Search (NDFS), but they are not capable of fnding a minimal counterexample. Abasic NDFS algorithm proposed by [130] is depicted in Figure 2. The algorithm is based on computing the accepting sta­tes by performing a simple search, once an accepting state is found, another search is performed to fnd an accepting cycle through it. Although minimal counterexamples can be computed in polynomial time using minimal paths algorithms, the main drawback, infact, is the memory, where the resulting Bchi automaton to be checked for emptiness is usually very huge,the thing that makes storing all the minimal paths to be compared so diffcult. Recently, new methods were proposed to compute mi­nimal counterexample in Bhi automaton [77, 64, 63]. Hansen andKervinen [77] proposeda DFS algorithm that runs in O(n2) and they showed that O(n log n) is suff­cient, although DFS algorithms are memory consuming in general. This is due to the optimizations added using in­terleaving. Since the algorithms are based on exploring transitions backwards, adapting this method in practice is very diffcult, especially by considering some restrictions. While this method requires more memory than the model checker SPIN does, [64, 63] proposed a method that does not use more memory than SPIN does. Whilethe frst one uses DFS and its time complexityisexponential [64], Gas-tin and Moro proposeda BFS algorithm with some optimi­zations able of computing the minimal counterexample in polynomial time [63]. Hansen et al. [76] also proposed a method for computing minimal counterexamples based on Dijkstra algorithm for detecting strongly connected com-ponents.Anovel approachwas proposedby [93]for gene­rating short counterexamples based on analyzing the entire model and defning which events have more contribution to the error, these events are called crucial. In addition to generating short counterexamples, the technique helps with reducing the state space. The main drawback of this met-hodishowto determineifsuchsetofeventsare crucialand really led to the error. 3.2 Short counterexamples in symbolic model checking The original algorithm for counterexample generation in symbolic model checking was proposed by [31] and was implemented in most symbolic model checkers. This al­gorithm of generating linear counterexamples for the linear fragmentofACTLwas laterextendedto handle arbitrary ACTL properties using the notion of tree-like counterex­amples [36]. The authors realized that linear counterexam­ples are very weak for ACTL, and thus they proposed to generate tree-like Kripke structure instead, which is proven to be a viable counterexample[36, 38]. Formally, a tree­like counterexample is a a directed tree whose SCCs are either cycles or simple nodes. Figure 3 shows an exam­ple of a tree-like counterexample for the ACTL property .G¬a ..F ¬b. As we see in the fgure, the counterexam­ple consists of two paths refuting both subformulas. The frst path leads to a state that satisfes a, whereas the se­cond path, which is expected to be an infnite one, along which b always holds. The generic algorithm for generating tree-likecounterexamples as proposed in [36] is depicted in Figure 4. Figure3:Atree-like counterexample for .G¬a ..F ¬b. The counterexample is constructed from an indexed Kripke structure K. that is obtained by creating isomor­phic copyfor each state in the original Kriple structure K, whereby no repeated state can be found. This process is called path unraveling. The algorithm traverses the speci­fcation formula in depth manner, where each subformula is evaluated recursively. The symbol O refers to temporal operator, and C is a global variable that is used in unrave­ling through denoting index of states. The algorithm outputs a sequence of descriptors of the form (path descriptor) and S (loop descriptor), where {desc1, desc2} describes a f­nite path leading to a cycle. The tree-like counterexample SQQ will be then , where refers to the set of descriptors generatedbyCEX algorithm.Thesetof descriptorsforthe example in Figure 3 would be: , and . . i 1: procedureCEX(K, s0, .) 2: case . of 3: .1 . .2: i 4: CEX(K, s0, .1) i 5: CEX(K, s0, .2) 6: .i.1.i: 7: .1 . .2: 8: Select j such that K, s0 6|= .j i 9: CEX(K, s0, .j) 10: .O(.1, ..., .k): 11: Determine . = s0, ..., sN , ..., sN+M such that K, . 6|= O(.1, ..., .k) i 12: desc1:= hs0, unravel(C, s1, ..., sN )i 13: desc2:= hunravel(C + N, sN , ..., sN+M )i. 14: return desc1 and desc2 15: C := C + N + M +1 S 16: for all statesp .{desc1, desc2} do 17: forj .{1, ..., k} do 18: if K, p 6|= .j then 19: CEX(K, p, .j ) 20: end if 21: endfor 22: endfor 23: end case 24: end procedure Figure 4: The generic counterexample algorithm for ACTL[36]. After these works of Clarke et al., many works have ad­dressed the issue of computing short counterexamples in symbolic model checking [117, 29, 108]. Schuppan et al. [117] proposed some criteria that should be met for the Bchi automaton to accept shortest counterexamples. They proved that these criteria are satisfed in the approach pro-posedby [29] just for future timeLTL specifcation, and thus theyproposed an approach that meets the criteria pro­posed forLTL specifcations with past. The algorithm pro­posed employs breadth-frst reachability check with Binary Decision Diagrams(BDD)-based symbolic model checker. The authorsin [108] proposeda black-box based techni­que that masks some parts of the system in order to give an understandable counterexample to the designer. So the work does not just tend to produce minimal counterexam-ples,but also, it delivers small indicative counterexample ofgood quality to be analyzed in order to get the source H. Debbi of the error. The major drawback of this method is that the generalization of counterexample generation from sym­bolic model checking to black box model checking, could lead to non-uniform counterexamples that do not meet the behavior of the system intended. While all of these works are applied to unbounded model checking [117, 108], the works [122, 120, 113] consider bounded model checking, through lifting assignments produced by a SAT solver, where the quality of the counterexample generated depends ontheSATsolverin use. Otherworkshaveinvestigatedthe use of heuristics algorithms for generating counterexam­ples [124, 50]. Although heuristics were not widely used, theygave pretty good results and were also used later for generating probabilistic counterexamples. 4 Counterexamples analysis and debugging One of the major advantages of model checking over ot­her formal methods is its ability to generate a counterex­ample when the model falsifes such specifcation. The counterexample represents an error trace; by analyzing it the user can locate the source of the error, and as Clarke wrote:“The counterexamples are invaluable in debugging complex systems. Some people use model checking just for this feature” [27]. However, generating small and indicative counterexam­ples only is not enough for understanding the error. There­fore, counterexamples explanation is inevitable. Error ex-planationisthetaskof discoveringwhythe systemexhibits this error trace. Many works have addressed the automatic analysis of counterexamples to better understand the fai­lure. Error explanation ranges in two main categories. The frst is based on the error traceitself, through considering the small number of changes that have to be made in order to ensure that the given counterexample is no longer exhi­bited, and thus, these changes represent the sources of the error. The second is based on comparing successful execu­tions with the erroneous one in order to fnd the differen­ces, andthus those differences are considered as candidate causes for the error. Kumar et al. [97] have introduced a careful analysis of the complexity of each type. For the frst type, theyshowed using three models (Mealy machi­nes,extended fnite state machines, and pushdown automa-ton)thatthisproblemis NP-complete.Forthesecondtype, theyprovided a polynomial algorithm using Mealy machi­nes and pushdown automaton,but solving the problemwas diffcult with extended fnite state machines. Error explanation methods are successfully integrated into model checkers such as SLAM [12] andJavaPathFin­der JP [16]. SLAM takes less execution time than JP, and can achieve completeness in fnding the causes,but accor­ding to Groce [67], this also could be harmful. The error explanation process has manydrawbacks; the main one is that the counterexample consists usually of a huge number of states and transitions and involves many variables. The second is that model checker usually foods the designer with multiple counterexamples, without any kind of clas­sifcation. This makes challenging the task of choosing a helpful counterexample for debugging purposes. Besides, a single counterexample it might not be enough to under­stand the behavior of the system. Analyzing a set of coun­terexamples together is an option but the problem is that it requires much effort, and even though, the set of counte­rexamplestobe analyzed could containthe same diagnostic information, which may make analyzing this set of counte­rexamplesawasteof time. The lastandthe most important problem in error explanation is that not all the events that occur on the error trace are of importance for the designer, so locating critical events is the goal behind error explana­tion. In this section, we survey some works with respect to the two categories. 4.1 Computing the minimal number of changes Jin et al. [92] proposed an algorithm for analyzing the counterexamples based on the local information, by seg­menting the events of the counterexamples in two main segments,fated and free. Thefated segments refer to the events that obviously have to occur in the executions, and thefreesegmentsrefertotheeventsthat shouldbeavoided forthe errornotto occur,andthusthey are candidatetobe causes.Fated and free segments are computed with respect to input variables in the system, where they are classifed into controlling and non-controlling. While controlling va­riables are considered to be critical, and have more control ontheenvironment,the non-controllingvariableshave less importance. So that, fated segments are determined with respect to controlling variables, whereas free segments are determined with respect to non-controlling ones. Wangetal. [132]also proposedamethodthatworksjust on thefailedexecution path without considering successful ones. The ideais about looking at the predicates candidate for causing the failure in the error trace. To do so, they use weakest pre-condition computation, the technique that is widely used in predicate abstraction. This computation aims to fnd the minimal number of conditions that should be met in order to not let the program violate the asser­tion. This results in a set of predicates that contradict with each other. By comparing how these predicates contradict to each other, we can fndthe cause for the assertionfai-led and map it back to the real code. Manysimilar works also provided error explanation methods in the context of Cprograms [137, 127, 138]. Using the notion of causality by Halpern and Pearl [74], Beer et al. [88] introduced a polynomial-time algorithm forexplainingLTL counterexamples thatwas implemented as a feature in the IBM formal verifcation platform Rule-Base PE. Given the error trace, the causes for the violation are highlighted visually as red dots on the error trace it­self. The question asked was: what values of signals on the trace causeittofalsifythe specifcation?Followingthe Informatica 42 (2018) 145–166 151 defnition of Halpern and Pearl, theyrefer to such a set of pairs of state-variable as bottom-valued pairs whose values should be switched to make such state-variable pair criti­cal. The pair is said to be critical if changing the value of thevariableinthis statenolonger producesa counterexam­ple. Thispair representsthe causeforthe frstfailureofthe LTL formula given the error trace, where they argue that the frstfailure is the most relevant to the user. Neverthe­less, the algorithm computes an over-approximation of the set of causes not just the frst cause that occurred. Let . be an LTL formula in Negation Normal Form (NNF) and . = s0,s1, ..., sk a counterexample for it. The algorithm for computing the approximate set of causes gi­ven. and . is depicted in Figure 5. The procedure invokes each time a function val for evaluating sub-formulas of . on the path. The procedure is executed recursively given the formula . until it reaches the proposition level, where the causeis fnally renderedasapair hsi,pi/hsi, ¬pi,where si refers to the current state. Let us consider the formula : G((¬ST ART . ¬ST AT US_V ALID . END) › [¬ST ART UST AT US_V ALID]). The result of executing the RuleBase PE implementation of the algo­rithm on this formula is shown in Figure 6. The red dots refer to the relevant causes for the error. Where some variables are not critical for thefailure, others can be cri­tical, which means that switching their values alone could result in mitigating the violation. For instance, in state 9, ST ART precedes ST AT US_V ALID, by switching the value of ST ART from1to0in state 9, the formula would notfail anymoregiven this counterexample. 4.2 Comparing counterexamples with successful runs This is the most adopted method for error explanation that is successfully featured in many model checkers such as SLAMandJavaPathFinder. Groceetal. [70]have propo­sed an approach for counterexamples explanation based on computinga setoffaulty runs called negatives,in which the counterexample is included, and comparing it to a set of correct runs called positives. Analyzing the common features and differences could lead to getting a useful di­agnostic information. Their algorithms were implemented in JAVA pathfnder. Based on Lewis counterfactual the­ory of causality [105] and distance metrics, Groce [68] has proposeda semi-automated approachfor isolatingerrorsin ANSICprograms,by considering the alternativeworlds as programs executions and the events as propositions about thoseexecutions. The approach relies on fnding causal de­pendencies between predicates of a program. Apredicate a is causally dependent on b given thefaultyexecution,if only if the executions in which the removal of a cause a also removes the effect b are more likely than the executi­ons where a and b do not appear together.For fnding these traces, which are as close as possible to thefaulty one, the authors employed distance metric. A description of their 1: procedureCAUSES(., .i) 2: Case . of 3: p: 4: if p 6. si then 5: return hsi,pi 6: end if 7: ¬p: 8: if p . si then 9: return hsi,pi 10: end if 11: X.: 12: if i P (CX2). The strongest path is s0s1, which is included in the most indicative probabilistic counterexample. 5.1 Probabilistic counterexample generation Various approaches for probabilistic counterexamples ge­nerationhavebeen proposed. Aljazzaretal.[1,3] introdu­ced an approach for counterexample generation for DTMC and CTMC against timed reachability proprieties using heuristics and directed explicit state space search. Since resolving nondeterminism in an MDP results in a DTMC, in complementary work [4], Aljazzar and Leue proposed an approach for counterexample generation for MDPs ba­sed on existing methods for DTMC. Aljazzar and Leue in­troduced a complete work in [5] for generating counterex­amples for DTMCs and CTMSs as what they refer to as diagnostic sub-graphs. All these works on generating in­dicative counterexamples have led to the development of the K* algorithm [6], an on-the-fy heuristics guided al­gorithm for the K shortest path problem. Comparing to classical k-shortest-paths algorithms, K* has two main ad­vantages, it woks on-the-fy in way it avoids exploring the entire graph, and it can be guided using heuristic functions. Based on all the previous works, they built a tool DiPro [2] for generating indicative counterexamples for DTMCs, CTMCs and MDPs. This tool can be jointly used with the model checkers PRISM [81] and MRMC [94], and can ren­der the counterexamples in text format as well as in graphi­cal mode. These heuristic-based algorithmsshoweda great effciency in terms of counterexample quality. Neverthe­less, with large models, DiPro tool that implements these algorithms takes usually a long time to produce a counte­rexample. By running DiPro on a PIRSM model of the DTMC presented in Figure 12 against the same property, we obtain the most indicative counterexample CX3. The graphical representation of CX3 as rendered by DiPro is depicted in Figure 13. The diamonds refer to the fnal or end states(s1,s3,s5), whereas the circles represent simple nodes s2 and s4. The user can navigate through the coun­terexample and inspect all values. Similar to the previous works, [75] has proposed the no­tion of smallest most indicative counterexample that redu­ Figure 13:Acounterexample generatedby DiPro. cestothe problemof fndingKshortest paths.Ina weigh­ted digraph transformed from the DTMC model, andgiven initial state and the target states, the strongest evidences that form the counterexample are selected using extensi­ons of K-shortest paths algorithms for an arbitrary number k. Instead of generating path-based counterexamples, [134] have proposeda novel approach for DTMCs and MDPs ba­sed on critical subsystems using SMT solvers and mixed integer linear programming. Critical subsystem is simply a part of the model (states and transitions) that are con­sidered relevant because of its contribution to exceeding the probability bound. The problem has been shown that is NP-Complete. Another work always based on the no­tion of critical subsystem is proposed to deliver abstract counterexamples with less number of states and transiti­ons using hierarchical refnement method. Based on all of these works, Jansen et al. proposed the COMICS tool for generating the critical subsystems that induce the counte­rexamples [90]. There are also manyother works that addressed special cases for generating counterexamples in PMC. the authors of [8], proposed an approach for fnding sets of evidences for bounded probabilisticLTL properties on MDP that be-havedifferently from each othergiving signifcant diagnos­tic information. While their method is also based on K-shortest path, the main contribution is about selecting the evidences or the witnesses with respect to main fve cri­teria in addition to the high probability. While all of the previousworks for counterexample generation areexplicit-based, the authors in [133] proposed a symbolic method using bounded model checking. In contrast to the previ­ous methods, this method lacks the selection of the stron­gest evidences frst, since the selection is performed in ar­bitrary order. Another approach for counterexample gene­ration that uses bounded model checking has been propo- Informatica 42 (2018) 145–166 155 sed [15]. Unlike the previous workthat uses conventional SATsolvers, the authors used a SMT-solving approach in order to put some constraints on the paths selected, in order to get more abstract counterexample that consists of stron­gest paths. Counterexample generation for probabilistic LTL model checking has been addressed in [116] and pro­babilistic CEGAR has been also addressed [80]. A com­prehensive representation of the counterexamples using re-gularexpressions has been addressedin[44]. Since regu­lar expressions deliver compact representations, they can help to deliver short counterexamples. Besides, they are widely known and easily understandable, so that theywill give more benefts as a tool for error explanation. 5.2 Probabilistic counterexample analysis Instead of relying on the state space search resulted from the parallel composition of the modules, [135] suggests to rely directly on the guarded command language used by the model checker, which is more likely and helpful for de­bugging purpose. To do so, the authors employthe critical subsystem technique [134] to identify the smallest set of guarded commands contributing to the error. To analyze probabilistic counterexamples, Debbi and Bourahla [48, 47] proposed a diagnostic method based on the defnition of causality by Halpern and Pearl [74] and responsibility [25]. The method proposed takes the pro­babilistic counterexample generated by DiPro tool and the probabilistic formula as input, and returns a set of pairs (state-variable) as candidate causes for the violation orde­red with respect to their contribution to the error. So, in contrast to the previous methods, this method does not tend to generate indicative counterexamples,but it acts directly on indicative counterexamples already generated. Another similar approach for debugging probabilistic counterexam­ples has been introduced by [46]. It adopts the same def­nitionof causalityby Halpernand Pearlto reason formally about the causes, and then transforms the causality model into regression model using Structural Equation Modeling (SEM). SEM is a comprehensive analytical method used for testing and estimating causal relationships between va­riables embedded in theoretical causal model. This met­hod helps to understand the behavior of the model through quantifying the causal effect of the variables on the viola­tion, and the causal dependencies between them. The same defnition of causality has also been adopted toevent orders for generatingfault trees from probabilis­tic counterexamples, where the selection of traces forming the fault tree are restricted to some minimality condition [102]. To do so, Leitner-Fischer and Leue proposed the event order logic to reason about Boolean conditions on the occurrence of events, where the cause of the hazard in their context is presented as an Event Order Logic (EOL) formula,whichisa conjunctionofevents.In[57],theyex-tended their approachbyintegrating causalityinexplicit-state model checking algorithm to give a causal interpreta­tion for sub-and super-sets of execution traces, the thing that could help the designer to get a better insight on the behavior of the system. They proved the applicability of their approach to manyindustrial size PROMELA models. They extended the causality checking approach to proba­bilistic counterexamples by computing the probabilities of events combination [101],but theystill consider the useof causality checking of qualitative PROMELA models. 6 Counterexample guided abstraction refnement (CEGAR) The main challenge in model checking is the state explo­sion problem. Dealing with this issue is in the heart of model checking, it was addressed at the beginning of mo­del checking and not fnished. Many methods were pro­posed to tackle this issue, the mostfamous are: symbolic algorithms,Partial Order Reduction (POR), Bounded Mo­del Checking (BMC) and abstraction. Among these techni­ques, abstraction is considered as the most general and fex­ible for handling the state explosion problem [30]. Ab­straction is about hiding or simplifying some details about the system to be verifed, even removing some parts from it that are considered irrelevant for the property under con­sideration. The central idea is that verifying a simplifed or an abstract model is more effcient than the entire model. Evidently,this abstractionhasaprice,whichislosingsome information, and the best abstraction methods are those that control this loss of information. Over-approximation and under-approximation are two main key concepts for this problem. Many abstraction methods have been proposed [42, 65, 106], the last one had the most attention and was adopted in the symbolic model checker NuSMV. Abstraction can be defned by a set of abstract states Sb, an abstraction mapping function h that maps the states in the concrete model to Sb, and the set of atomic propositi­ons AP labeling these states. Regarding the choice on Sb, h and AP , we distinguish three main typesof abstraction: predicate abstraction [66, 115], localization reduction [99] and data abstraction [39]. Predicate abstraction is based on eliminating some variables from the program to be re­placed by predicates that still serve the information about these variables. Each predicate has a Boolean variable cor­responding to it, where the abstract states S bresulted are valuations of these variables. Both the abstraction map­ping h between the concrete and abstract states, and the set of atomic propositions AP , are determined with respect to the truthvaluesof these predicates. The entire abstract mo­del can then be defned through existential abstraction. To this end, we can use BDDs, SAT solvers or theorem pro­vers depending on the size of the program. Localization reduction and data abstraction are actually just extensions of predicate abstraction. Localization reduction aims to de­fne a small set of variables that are considered relevant to the property in hand to be verifed, these variables are cal­led visible, the rest of variables that have no importance with respect to the property to be verifed are called invisi- H. Debbi ble. We should mention that this technique does not apply anyabstractiononthe domainof visiblevariables. Dataab­stractiondealsmainlywiththe domainsofvariablesbyma-king an abstract domain for each variable. So the abstract modelwillbebuiltwith respecttothe abstractvalues.For more detail on abstraction techniques, we refer to [71]. Given the possible loss of information caused by the abstraction, inventing some refnement methods of the abstract model is necessary. The most known method for abstraction refnement is Counterexample-Guided Ab­straction Refnement (CEGAR) that has been proposed by [30] as a generalization of the localization reduction ap­proach. A prototype implementation of this method in NuSMV has also been presented. In this approach, the counterexample plays the crucial role for fnding the right abstract model. The process of CEGAR consists of three main steps: the frst is to generate an abstract model using one of the abstractions techniques [30, 23, 33] given a for­mula .. The second step is about checking the satisfaction of ., if it is satisfed thenthe model checker stops and re­turns that the concrete or the original model satisfes the formula, if it is not satisfed, a counterexample will be ge­nerated. The counterexample generated is in the abstract model, so we have to check if it is also a valid counterex­ample in the concrete model, because the abstract model has different behavior comparing to the concrete one. Ot­herwise, the counterexample is called spurious and the ab­straction must be carried out based on this counterexample. So, a spurious counterexample is an erroneous counterex­ample that exists only in the abstract model, not the con­crete model. The fnal step is to refne the model until no spurious counterexample is found (see Figure 14). This is how the technique gets its name, refning the abstract mo­del using the spurious counterexample. Refnementis an important taskof CEGAR that can make the processfaster andgives the appropriate results.To refne the abstract mo­del, different partitioning algorithms are applied to abstract states. Like abstraction, partitioning the abstract states in order to eliminate the spurious counterexample can be car­ried out in manyother ways than BDDs [30]. SATsolvers [24] or linear programming and machine learning [34] can be used to defne the most relevant variables to be conside­red for the next abstraction. In the literature, we fnd many extensions for CEGAR depending on the type of predicates and application dom­ains: large program executions [96], non-Disjunctive ab­stractions [107] and propositional circumscription [89]. The CEGAR technique itselfhasbeenusedtofndbugsin complex and large systems [14]. The idea is based ongat­hering and saving information during the abstract model checking process in order to generate short counterexam­ples in the case offailure. This could be helpful for large models that make generating counterexamples using stan­dard BMS intractable. CEGAR currently is implemented in manytools such as NuSMV[26], SLAM and BLAST[13]. 7 Counterexamplesfor test cases generation Counterexample generation gives the opportunity for mo­del checking to be adopted and used in different domains, one of the domains in which the model checking has been adopted is test cases generation. Roughly speaking, tes­ting is an automated method used to verify the quality of software. When we use model checking to generate test cases, this is called model-based testing. The use of mo­del checking for testing is mainly subjected to the size of the software to be tested, because a suitable model must be guaranteed. The central idea of using model checking for testing [20, 55] is about interpreting counterexamples ge­nerated by the model checkers as test cases, and then test data and some expected results are extracted from these tests using such execution framework. Counterexamples are mainly used to help the designer to fnd the source of the error. However, they are very useful as test cases. [60]. A test describes the behavior of the test case intended: the fnal state, the states that should be traversed to reach the fnal state and so forth. In practice, it might not be possible to execute all test cases, since the software to be tested has usually a large number of behaviors. Neverthe­less, there exist some techniques to help us to measure the reliability of testing. These techniques range in two main categories: frst, deterministic methods (given initial state and some input, we will be certainty aware about the out­put), mostfamous methods for this category are coverage analysis and mutation analysis. Second, statistical analysis, where the reliability of the test is measured with respect to some probability distribution. In coverage-based testing, the test purpose is specifed in temporal logic and then converted to what is called a never-claim by negation; to assert that the test purpose ne­ver becomes true. So, the counterexample generated after the verifcation process will describe how the never-claim is violated, which is a description of how test purpose is fulflled. Many approaches for creating never-claims ba­sed on coverage criterion (called “trap properties”) [61] are proposed. Coverage criteria aim to fnd how such a system is exercised given a specifcation in order to get the sta­tes that were not traversed during the test; in this context, we call this specifcation a test suit. So, a full coverage is achieved if all the states of the system are covered. To createa test suitethat covers all states, we needa trap pro­perty for each possible state. For example, claiming that the value of a variable is never 0: G¬(a = 0). Acounte­rexample to such a trap property is anytrace that reaches a state where (a = 0). With regard totrap properties, we fnd many variations. Gargantini and Heitmeyer addressed the coverage of soft­ware cost reduction (SCR) specifcations [61]. SCR spe­cifcations are defned by tables over the events that repre­sent the change of a value in state and lead to a new state, and conditions defned on the states.Formally,a SCR mo­del is defned as quadruple (S, S0,Em,T ) where S is the set of states, S0 is the initial state set, Em is the set of input events, and T is the transform function that maps an input event and the current state to a new one. SCR requirement properties can be used as never-claims, frst by converting SCR into model checkers languages such as SPIN language (PROMELA), or SMV language, and then transform SCR tables into if-else construct in the case of using SPIN, or a case statement in the case of SMV. Anot­her approach by Heimdahl et al. addressed the coverage of transition systems globally [79], where they consider the use of RSML-e as the specifcation language. A sim­ple example of transition coverage criteria is of the form G(A . C ›¬B), where A represents a system’s state s, B represents the next state, and C is the condition that guards the transition A to B. So a counterexample for this property could be a trace that reaches a state B when C evaluates to true, or a trace that reaches another state than B when C evaluates tofalse. Hong and Lee [87] proposed an approach based on control and data fow, where theyuse SMV model checker to generate counterexamples during the model checking of state-charts. The counterexample generated can be mapped into test sequence that induces information about which initial and stable states are consi­dered. Another approach based on abstract state machines has been introduced [62]. The trap properties here will be defned over a set of rules for guarded function updates. We can see that all coverage-based approaches deal with the same thing, which is trap properties, and defer from each other in the formalism adopted. Another approach for using requirement properties as test cases has been introduced by [54]. In this approach, each requirement has a set of tests. Trap properties can be easily derived from requirement properties under property-coverage criteria [125]. Another method that is completely different from coverage-based analysis is mutation-based analysis [18]. Mutation analysis consists of creating a set of mutants, which can be obtained by making small modi­fcationsonthe original programinway these mutants lead to realistic faults. We differ between each mutant by its score, the mutant with the high score indicates highfault sensitivity. It is evident that deriving such mutant that is equivalent to the original program will have a high com­putational cost [91], because we have to apply all the test cases to each mutant, and all mutants should be considered. And for each mutant the model checker must be invoked. Fraser et al. [59] reviewed in detail most of these techni­ques and proposed several effective techniques to improve the quality of the test cases generated in model checking-based testing, especially requirements based testing, and apply them on different types of properties in manyindus-trial case studies. 8 Counterexamples generation tools Practically,all successful model checkers are able to output counterexamples in varying formats [38]. In this section, we will try to surveythe tools supporting counterexample generation and study their effectiveness. A set of model checkers with their features are presentedinTable 1. Berkeley Lazy Abstraction Software Verifcation Tool (BLAST)[13]isasoftwaremodel checkingtoolforCpro-grams. BLAST has the ability to generate counterexam­ples, and furthermore, it employs CEGAR. BLAST is not justa CEGAR-based model checker,butitcanbe also used for generating test cases. BLAST shows promising results with safety properties of programs with a medium size. CBMC [35] is a well-known Bounded Model Checker for AINCICand C++ programs. CBMC performs symbo­licexecutiononthe programsandemploysaSATsolverin theverifcation procedure, when the specifcationisfalsi­fed, a counterexample in the form of states with variables valuation leading to these states is rendered to the user. JavaPathfnder(JPF) [131] is a famous software model checking tool for Java programs. JavaPathfnder is an ef­fective virtual machine-based tool thatverifes the program along all the possible executions. Due to its ability to deal with most of JAVA language features, because it runs on byte-code level, JavaPathfnder can generate a detailed re­port on the error in case that the property is violated. Besi­des, the toolgives the ability to generate test cases. SPIN[86]isa model checker mostlyknownfortheveri­fcation of systems that exhibit a high interaction between processes. The systems are described using Process Meta Language(PROMELA) language,andverifedagainstpro­perties specifedinLTL.By applyinga Depth-FirstSearch algorithm on the intersection product of the model and the Bhi automaton representing theLTL formula,a counte­rexampleis generatedin casean acceptedcycleis detected. SPIN offers an interactive simulator that helps to under­stand the causeof thefailureby showing the processes and their interactions in order. Table 1: Model checkers with their features. Name Model Checking Counterexample generation Programs, systems Algorithms, methods Modeling language Specifcation language Visualization Form BLAST Cprograms Predicate lazy abstraction, CEGAR C BLAST No Set of predicates CBMC Cprograms BMC, SAT solving C/C++ Assertions No Variables and valuations JPF Java programs Explicit state, POR Java No No Error report SPIN Concurrent, distributed, asynchronous Nested Depth First Search, POR PROMELA LTL Yes Execution path NuSMV Synchronous, asynchronous BDD-based, SAT-based BMC SMV LTL, CTL No States and valuations UPPAAL Ral-time On-the-fy, Symbolic Timed automata TCTL Yes Sequence of states PRISM Probabilistic, real-time Graph-based, numerical PRISM PCTL,CSL, PTCTL No-By DiPro Graph MRMC Probabilistic Numerical PRISM, PEPA PCTL,CSL, PRCTL,CSRL No-By DiPro or COMIC Graph NuSMV [26] is a symbolic model checker that appeared as an extension of the Binary Decision Diagrams(BDD)­based model checker SMV. NuSMV includes both LTL and CTL for specifcation analysis, and combines SATand BDD techniquesfortheverifcation. NuSMVcandelivera counterexample in XML format by indicating the states of the trace and the variables with their new values that cause the transitions. UPPAAL [100] is a verifcation framework for real-time systems. The systems can be modeled as networks of ti­med automata extended with data types and synchroniza­tion channels, and the properties are specifed usingaTi-med CTL(TCTL). UPPAAL can fnd and generate counte­rexamples in graphical mode as message sequence charts that indicate the events with respect to their order. PRISM[81]isaprobabilisticmodelcheckerusedforthe analysis of systems that exhibit stochastic behavior. The systems are described as DTMCs, CTMCs or MDPs, using guarded command language, and verifed against probabi­listic properties expressed in PCTL and CSL, and can be extended with rewards. Another successful probabilistic model checker extended with rewards is the Markov Re­ward Model Checker (MRMC) [94]. MRMC is mainly used for performance and dependability analysis. It takes the models as input fles in two formats, in PRISM lan­guage or Performance Evaluation Process Algebra (PEPA). Although both model checkers have shown high effective­ness, they lack a mechanism for generating probabilistic counterexamples.Nevertheless,theyhave been usedby re­cent tools (DiPro [2] and COMICS [90]) for generating and visualizing probabilistic counterexamples. 9 Conclusions and future directions In this paper we surveyed counterexamples in model checking from different aspects. At the beginning of using model checking, counterexamples have not been treated as a particular subject, but they have been treated as a rela­ted problem tofaircycle detection algorithms, as presented in section 3. But recently, the quality of the counterex­amples generated has been treated as a standalone and a fundamental problem. Many works tried to deliver short and indicative counterexamples to be used for debugging purpose. Concerning their structure, tree-like counterex­amples have been proposed for the fragment ofACTL as an alternative for linear counterexamples, however, we see that this approach has not been adopted in model checkers, but instead model checkers are still based on generating simple non-branching counterexamples . For debugging, we can conclude thatapproaches that re­quire other successful runs might have some advantages over other methods based on single trace, in way that they compare manygood traces to restrict the set of candidate causes. However, these methods take usually much execu­tiontimeinordertoselectthe appropriatesetoftraces,and besides, such traces could contain the same diagnostic in­formation. Regardless of the debugging method in use, the challenge of visualizing the error traces and the quality of diagnoses generatedtofacilitatedebuggingis stillanopen issue. For the case of counterexample generation in PMC, we have seen that the principle of counterexample generation is completely different than conventional model checking, where the presentation of counterexample is different from a work to another, from smallest and indicative set of paths to most critical sub-systems. Despite the notable advance­ment in generating probabilistic counterexamples that led to inventing important tools like DiPro and COMICS, un­fortunately this advancement is still insuffcient for debug­ging. Actually, it is more than important to see the techni­ques for counterexample generation and counterexample analysis integrated in probabilistic model checkers to get their beneft. All these techniques act onverifcation results of probabilistic model checkers like PRISM, so making the approaches of counterexample generation and counterex­amples analysis to be performed during the model checking process itself is still an open problem. This could really have a great impact on probabilistic model checking. We have also seen the usefulness of counterexamples for other purposes than debugging, like CEGAR and test cases generation. For CEGAR, we have seen different approa­ches for both abstraction and refnement. We have seen that we can beneft from using SAT solvers and theorem provers on the both sides, abstraction and refnement, thus they arevery useful for CEGAR.Fast eliminationof spu­rious counterexamples is still an active research topic. We also expect to see more works on CEGAR in PMC. For testing, we haveseen that the most useful approaches using model checking are those based on coverage and trap properties. Other approaches for testing like requirement-based analysis and mutation-based analysis have received smaller attention due to the limitations presented. Cur­rently, coverage-based techniques are widely used in the industry. In the future, we expect to see the proposition of new approaches to enable us to test new emerging sy­stems, which require new transformation mechanisms for enabling trap properties to be verifed by model checkers to generate the counterexamples. We should mention that such techniques can beneft from other techniques.For instance, neweffcient CEGAR techniques will not only have an impact on conventional model checking, but on probabilistic model checking as well. We can also see in the future the use of probabilis­tic model checkers like PRISM for testing probabilistic sy­stems. Since PRISM does not generate counterexamples, anyadvancement in generating indicative counterexamples could be of beneft for testing probabilistic systems. We can also see that techniques based on counterexamples like CEGAR can directly beneft from anyadvancement in ge­nerating small and indicative counterexamples in a consi­derable time. In additiontoallofthis,weexpecttosee moreworksin other domains that adapt model checking techniques just for the seek of getting counterexamples. In previous works we have seen for instance that counterexamples can be mapped to UML sequence diagrams, describing states and events in the original model [51], theycan be used to gene­rate attack graphs in networks security [123], in fragmen­tation of services in Service-Based Applications (SBAs) [21], and they have been also used to enforce synchroni- H. Debbi zability and realizability in distributed services integration [72]. References [1] Aljazzar, H., Hermanns, H., and Leue, S. Coun­terexamples for timedprobabilistic reachability. In FORMATS (2005), LNCS, vol. 3829, Springer, Ber­lin, Heidelberg, pp. 177–195. [2] Aljazzar, H., Leitner-Fischer, F., Leue, S., and Si-meonov, D. Dipro -a tool for probabilistic coun­terexample generation. In 18th International SPIN Workshop (2011), LNCS, vol. 6823, Springer, Ber­lin, Heidelberg, pp. 183–187. [3] Aljazzar, H., and Leue, S. Extended directed search for probabilistic timed reachability. In FORMATS (2006), LNCS, vol. 4202, Springer, Berlin, Heidel­berg, pp. 33–51. [4] Aljazzar, H., and Leue, S. Generation of counterex­amples for model checking of markov decision pro­cesses. In International Conference on Quantitative Evaluation of Systems (QEST) (2009), pp. 197–206. [5] Aljazzar, H., and Leue, S. Directed explicit state-space search in the generation of counterexamples for stochastic model checking. IEEETrans. on Soft­ware Engineering 36,1(2010), 37–60. [6] Aljazzar, H., and Leue, S. K*: A heuristic search algorithm for fnding thekshortest paths. Artifcial Intelligence 175, 18 (2011), 2129 – 2154. [7] Aloul,F., Ramani, A., Markov, I., and Sakallah, K. Pbs: Abacktrack search pseudo boolean solver. In Symposium on the Theory and Applications of Satis­fabilityTesting (SAT) (2002), pp. 346–353. [8] Andres, M. E., DArgenio, P., and van Rossum, P. Signifcant diagnostic counterexamples in probabi­listic model checking. In HaifaVerifcation Confe­rence (2008), pp. 129–148. [9] Aziz, A., Sanwal, K., Singhal, V., and Brayton, R. Model-checking continuous-time markov chains. ACM Transactions on Computational Logic 1,1 (2000), 162–170. [10] Baier, C., Haverkort, B., Hermanns, H., and Katoen, J.-P. Model checking algorithms for continuous-time markov chains. IEEETransactions on Software Engineering 29,7(2003), 524–541. [11] Ball,T.,Naik,M.,and Rajamani,S. From symptom to cause: Localizing errors in counterexample tra­ces. In ACM Symposium on the Principles of Pro­gramming Languages (2003), pp. 97–105. [12] Ball, T., and Rajamani, S. The slam project: De­bugging system software via static analysis. InACM Symposium on the Principles of Programming Lan­guages (2002), pp. 1–3. [13] Beyer, D., Henzinger,T., Jhala, R., and Majumdar, R. The software model checker blast: Applications to software engineering. International Journal on SoftwareTools forTechnologyTransfer (STTT)9,5 (2007), 505–525. [14] Bjesse, P., and Kukula, J. Using counterexample guided abstraction refnement to fnd complexbugs. In Design,Automation andTestin European Confe­rence and Exhibition (2004), pp. 156–161. [15] Braitling, B., andWimmer, R. Counterexample ge­neration for markov chains using smt-based boun­ded model checking. In FormalTechniques for Dis­tributed Systems (2011), LNCS,vol. 6722, Springer, Berlin, Heidelberg, pp. 75–89. [16] Brat,G.,Havelund,K.,Park,S.,andVisser,W.Java pathfndera second generationofajava model chec­ker. InWorkshopon AdvancesinVerifcation(2000). [17] Bryant, R. E. Graph-based algorithms for boolean function manipulation. IEEETrans. Comput 35,8 (1986), 677–691. [18] Budd,T., and Gopal,A. Program testingby speci­fcation mutation. Journal Computer Languages 10, 1(1985), 63–73. [19] Burch, J. R., Clarke, E. M., McMillan, K. L., Dill, D. L., and Hwang, L. J. Symbolic model checking: 1020 states and beyond. Information and Computa­tion 98,2(1992), 142–170. [20] Callahan,J., Schneider,F.,and Easterbrook,S. Au­tomated software testing using model checking. In SPINWorkshop (1996). [21] Chabane, Y., Hantry, F., and Hacid, M. Querying and splitting techniques for sba: Amodel checking based approach. In Emerging IntelligentTechnolo­gies in Industry (2011), SCI 369, Springer-Verlag, Berlin, Heidelberg, pp. 105–122. [22] Chaki, S., and Groce, A. Explaining abstract counte­rexamples. In SIGSOFT04/FSE (2004), pp. 73–82. [23] Chauhan,P., Clarke, E.,Kukula, J., Sapra, S.,Veith, H., and D.Wang. Automated abstraction refnement for model checking large state spaces using sat based confict analysis. In Formal Methods in Computer Aided Design(FMCAD) (2002), LNCS, vol. 2517, Springer, Berlin, Heidelberg, pp. 33–51. [24] Chauhan,P., Clarke, E.,Kukula, J., Sapra, S.,Veith, H., and D.Wang. Automated abstraction refne­ment for model checking large state spaces using sat Informatica 42 (2018) 145–166 161 based confict analysis. In FMCAD 2002 (2002), LNCS, vol. 2517, Springer-Verlag, Berlin, Heidel­berg, pp. 33–51. [25] Chockler, H., and Halpern, J.Y. Responsibility and blame: a structural model approach. Journal of Arti­fcial Intelligence Research(JAIR)22,1(2004), 93– 115. [26] Cimatti, A., Clarke, E., Giunchiglia, F., and Ro­veri., M. Nusmv: a new symbolic model verifer. In Proceedings Eleventh Conference on Computer-Aided Verifcation (CAV 99) (1999), LNCS, vol. 1633, Springer, Berlin, Heidelberg, pp. 495–499. [27] Clarke, E. The birth of model checking. In Grum-berg,O.,Veith,H. (eds.)25Yearsof Model Checking (2008), LNCS, Springer, Berlin, Heidelberg, pp. 1– 26. [28] Clarke, E., and Emerson, A. Design and synthe­sis of synchronization skeletons using branching-time temporal logic. In Logic of Programs (1982), Springer-Verlag, pp. 52–71. [29] Clarke, E., Grumberg, O., and Hamaguchi, K. Anot­her look at ltl model checking. Formal Methods in System Design 10,1(1997), 47–71. [30] Clarke, E., Grumberg, O., Jha, S., Lu,Y., andVeith, H. Counterexample-guided abstraction refnement for symbolic model checking. Journal of theACM (JACM) 50,5(2003), 752–794. [31] Clarke, E., Grumberg, O., McMillan, K., and Zhao, X. Effcient generation of counterexamples and wit­nesses in symbolic model checking. In Proc. of the DesignAutomation Conference (1995),ACM Press, pp. 427–432. [32] Clarke, E., Grumberg, O., and Peled, D. Model Checking. MIT, 1999. [33] Clarke,E., Gupta,A.,Kukula,J.,and Strichman,O. Sat based abstraction refnement using ilp and ma­chine leraning techniques. In Computer-AidedVe­rifcation (CAV) (2002), LNCS, vol. 2404, Springer, Berlin, Heidelberg, pp. 137–150. [34] Clarke, E., Gupta, A., Kukula, J., and Strichman, O. Sat based abstraction refnement using ilp and machine leraning techniques. In CAV2002 (2002), LNCS, vol. 2404, Springer-Verlag, Berlin, Heidel­berg, pp. 265–279. [35] Clarke, E., Kroening, D., and Lerda, F. A tool for checking ansi-c programs. In TACAS 2004 (2004), LNCS, vol. 2988, Springer, Berlin, Heidel­berg, pp. 168–176. [36] Clarke, E., Lu, Y., s. Jha, and Veith, H. Tree-like counterexamples in model checking. In Proc. of the 17th Annual IEEE Symposium on Logic in Computer Science (2002), pp. 19–29. [37] Clarke, E., O.Grumberg, Jha, S., Lu,Y., andVeith, H. Counterexample-guided abstraction refnement. In CAV (1986), pp. 154–169. [38] Clarke, E., andVeith, H. Counterexamples revisi­ted: Principles, algorithms and applications. In In: Grumberg, O., Veith, H. (eds.) 25 Years of Model Checking (2008), LNCS, Springer, Berlin, Heidel­berg, pp. 1–26. [39] Clarke, E. M., Grumberg, O., and Andlong, D. Mo­del checking and abstraction. ACMTransactions on Programming Languages and Systems 16,5(1994), 1512–1542. [40] Cleve, H., and Zeller,A. Locating causes of program failures. InACM/IEEE International Conference on Software Engineering (ICSE) (2005), pp. 342–351. [41] Copty,F., Irron, A.,Weissberg, O., Kropp, N., and Gila, K. Effcient debugging in a formal verifcation environment. IntJ SoftwToolsTechnolTransfer4 (2003), 335–348. [42] COUSOT,P., and COUSOT, R. Abstract interpre­tation: Aunifed lattice model for static analysis of programs by construction or approximation of fx-points. In ACM Symposium of Programming Lan­guage (2003), pp. 238–252. [43] Couvreur, J. On-the-fy verifcation of linear tempo­ral logic. In FM (1999), LNCS, vol. 1708, Springer, Heidelberg, pp. 253–271. [44] Damman, B., Han, T., and Katoen, J. Regular ex­pressions for pctl counterexamples. In Quantitative Evaluation of Systems(QEST) (2008), pp. 179–188. [45]deAlfaro,L., Henzinger,T.,andMang,F. Detecting errors before reaching them. In CAV (2000), LNCS, vol. 2725, Springer, Berlin, Heidelberg, pp. 186– 201. [46] Debbi, H. Diagnosis of probabilistic models using causality and regression. In in Proceedings of the 8th InternationalWorkshop onVerifcation and Eva­luation of Computer and Communication Systems (2014), pp. 33–44. [47] Debbi, H. Systems Analysis using Model Checking with Causality. PhD thesis, University of M’sila, 2015. [48] Debbi, H., and Bourahla, M. Causal analysis of pro­babilistic counterexamples. In EleventhACM-IEEE International Conference on Formal Methods and Models for Codesign (Memocode) (2008), pp. 77– 86. H. Debbi [49] Dong, G., and Pei, J. Sequence Data Mining. Sprin­ger, 2007. [50] Edelkamp, S., Leue, S., and Lluch-Lafuente, A. Directed explicit-state model checking in the vali­dation of communication protocols. International Journal on SoftwareTools forTechnologyTransfer 5,2(2004), 247–267. [51] Elamkulam, J., Z. Glazberg, I. R., Kowlali, G., Chandra, S., Kohli, S., and Dattathrani, S. De­tecting design faws in uml state charts for embedded software. In HVC 2006 (2006), LNCS, vol. 4383, Springer-Verlag, Berlin, Heidelberg, pp. 109–121. [52] Emerson, E., and Halpern, J. Decision procedures and expressiveness in the temporal logic of bran­ching time. In STOC 82: Proceedings of the four­teenth annualACM symposium on Theory of com­puting (1982),ACM Press, pp. 169–180. [53] Emerson, E. A., and Lei, C.-L. Effcient mo­del checking in fragments of the propositional mu-calculus. In ProceedingsoftheFirst Annual Sympo­sium of Logic in Computer Science (1986), pp. 267– 278. [54] Engels, A., Feijs, L., and Mauw, S. Test genera­tion for intelligent networks using model checking. In ThirdInternationalWorkshop onTools and Algo­rithms for the Construction and Analysis of Systems. (TACAS97) (1997), LNCS,vol. 1217, Springer, Ber­lin, Heidelberg, pp. 384–398. [55] Engels, A., Feijs, L., and Mauw, S. Test genera­tion for intelligent networks using model checking. In Third International Workshop on Tools and Al­gorithms for the Construction and Analysis of Sy-stems(TACAS) (2010), LNCS, vol. 1217, Springer, Berlin, Heidelberg, pp. 384–398. [56] Fey, G., and Drechsler, R. Finding good counte­rexamples to aid design verifcation. In FirstACM and IEEE International Conference onFormal Met­hods and Models for Co-Design (MEMOCODE03) (2003), pp. 51–52. [57] Fischer, F., and Leue, S. Causality checking for complex system models. In Verifcation, Mo­del Checking, and Abstract Interpretation (VMCAI) (2013), LNCS, vol. 7737, Springer, Berlin, Heidel­berg, pp. 248–276. [58] Fisler,K., Fraer,R., Kamhi,G.,Vardi,M.,andYang, Z. Is there a best symbolic cycle-detection algo­rithm. In TACAS 2001 (2001), LNCS, vol. 2031, Springer, Berlin, Heidelberg, pp. 420–434. [59] Fraser, G. Automated SoftwareTesting with Model Checkers. PhD thesis, IST -Institute for Software Technology, 2007. [60] Fraser,G.,Wotawa,F.,and Ammann,P.E.Testing with model checkers. Journal of SoftwareTesting, Verifcation and Reliability 19,3(2009), 215–261. [61] Gargantini, A., and Heitmeyer, C. Using model checking to generate tests from requirements spe­cifcations. In ESEC/FSE99: 7th European Soft­ware Engineering Conference, HeldJointly with the 7thACM SIGSOFT Symposium on theFoundations of Software Engineering (1999), LNCS, vol. 1687, Springer, Berlin, Heidelberg, pp. 146–162. [62] Gargantini, A., Riccobene, E., and Rinzivillo, S. Using spin to generate tests from asm specifcati­ons. In Abstract State Machines 2003. Advances in Theory and Practice: 10th InternationalWorks-hop, ASM (2003), LNCS, vol. 2589, Springer, Ber­lin, Heidelberg, pp. 263–277. [63] Gastin, P., and Moro, P. Minimal counterexam­ple generation for spin. In 14th International SPIN Workshop 2007(2007), LNCS, vol. 4595, Springer, Berlin, Heidelberg, pp. 24–38. [64] Gastin, P., Moro, P., and Zeitoun, M. Minimi­zation of counterexample in spin. In SPIN 2004 (2004), LNCS, vol. 2989, Springer, Berlin, Heidel­berg, pp. 92–108. [65] GRAF, S., and ANDSADI, H. Construction of ab­stract state graphs with pvs. In CAV (1997), LNCS, vol. 1254, Springer, Berlin, Heidelberg, pp. 72–83. [66] Graf, S., and Saidi, H. Construction of abstract state graphs with pvs. In CAV 97 (1997), LNCS, vol. 1254, Springer-Verlag, Berlin, Heidelberg, pp. 72– 83. [67] Groce, A. Error Explanation andFault Localization with Distance Metrics. PhD thesis, School of Com­puter Science Carnegie Mellon University, 2005. [68] Groce, A., Chaki, S., Kroening, D., and Strichman, O. Error explanation with distance metrics. Interna­tionalJournal on SoftwareTools forTechnology4,3 (2006), 229–247. [69] Groce, A., Kroening, D., and Lerda, F. Under­standing counterexamples with explain. In Alur R., Peled D.A. (eds) Computer AidedVerifcation. CAV 2004 (2004), vol. 3114 of Lecture Notes in Compu­ter Science, Springer, pp. 453–456. [70] Groce,A.,andVisser,W. What went wrong: Explai­ning counterexamples. In SPINWorkshop on Model Checking of Software (2003), pp. 121–135. [71] Grumberg, O. Abstraction and refnement in model checking. In FMCO 2005 (2006), LNCS, vol. 4111, Springer-Verlag, Berlin, Heidelberg, pp. 219–242. Informatica 42 (2018) 145–166 163 [72] Gudemann, M., Salaun, G., and Ouederni, M. Coun­terexample guided synthesis of monitors for re­alizability enforcement. In ATVA 2012 (2012), LNCS, vol. 7561, Springer-Verlag, Berlin, Heidel­berg, pp. 238–253. [73] Guo, L., Roychoudhury, A., andWang,T. Accura­tely choosingexecution runs for softwarefault loca­lization.In 15th international conference on Compi­ler Construction (2006), LNCS,vol. 3923, Springer, Berlin, Heidelberg, pp. 80–95. [74] Halpern, J., and Pearl, J. Causes and explanations: Astructural-model approach part i: Causes. In17th UAI(2001), pp. 194–202. [75] Han, T., and Katoen, J. Counterexamples genera­tion in probabilistic model checking. IEEETrans. on Software Engineering 35,2(2009), 72–86. [76] Hansen, H., and Geldenhuys, J. Cheap and small counterexamples. In Software Engineering andFor­mal Methods, SEFM ’08 (2008), IEEE Computer Society Press, pp. 53–62. [77] Hansen, H., andKervinen, A. Minimal counterex­amplesino(nlogn) memoryando(n2)time. In ACDC 2006(2006), IEEE Computer Society Press, pp. 131–142. [78] Hansson, H., and Jonsson, B. logic for reasoning about time and reliability. Formal aspects of Com­puting6,5(1994), 512–535. [79] Heimdahl, M., Rayadurgam, S., andVisser,W. Spe­cifcation centered testing. In Second International Workshop onAutomates Program Analysis,Testing andVerifcation (2000). [80] Hermanns,H.,Wachter,B.,and Zhang,L. Probabi­listic cegar. In Computer AidedVerifcation (CAV) (2008), LNCS, vol. 5123, Springer, Berlin, Heidel­berg, pp. 162–175. [81] Hinton, A., Kwiatkowska, M., Norman, G., andPar­ker, D. Prism: A tool for automatic verifcation of probabilistic systems. In TACAS(2006), LNCS,vol. 3920, Springer, Berlin, Heidelberg, pp. 441–444. [82] Hojati, R., Brayton, R. K., andKurshan, R.P. Bdd-based debugging of designs using language contain­ment andfair ctl. In Fifth Conference on Computer AidedVerifcation (CAV93) (1993), LNCS,vol. 697, Springer, Berlin, Heidelberg, pp. 41–58. [83] Hojati, R., Brayton, R. K., andKurshan, R.P. Bdd-based debugging of designs using language contai­nment andfair ctl. In CAV 93 (1993), LNCS, vol. 697, Springer, Berlin, Heidelberg, pp. 41–58. [84] Hojati, R., Touati, H., Kurshan, R. P., and Bray-ton, R. K. Effcient -regular language containment. In Computer AidedVerifcation (1992), LNCS, vol. 1708, Springer, Berlin, Heidelberg, pp. 371–382. [85] Holzmann, G., Peled, D., andYannakakis, M. On nested depth frst search. In SPIN’96 (1996). [86] Holzmann, G. J. The model checker spin. IEEE Transactions on Software Engineering 23,5(1997), 1–17. [87] Hong, H. S., and Lee, I. Automatic test generation from specifcations for controlfowand data-fowco­verage criteria. In International Conference on Soft­ware Engineering (ICSE) (2003). [88] I.Beer, Ben-David, S., Chockler, H., Orni, A., and Treer, R. Explaining counterexamples using causa­lity. Formal Methods Systems Design 40,1(2012), 20–40. [89] Janota, M., Grigore, R., and Marques-Silva, J. Counterexample guided abstraction refnement al­gorithm for propositional circumscription. In JE­LIA’10 Proceedings of the 12th European confe­rence on Logics in artifcial intelligence (2010), LNCS, vol. 6341, Springer, Berlin, Heidelberg, pp. 195–207. [90] Jansen, N., Abraham, E.,Volk, M.,Wilmer, R., Ka­toen, J., and Becker, B. The comics tool -compu­ting minimal counterexamples for dtmcs. In ATVA (2012), LNCS, vol. 7561, Springer, Berlin, Heidel­berg, pp. 249–253. [91] Jia,Y., and Harman, M. An analysis and surveyof the development of mutation testing. IEEE Tran­sactions ON Software Engineering 37, 05 (2011), 649 – 678. [92] Jin,H.,Ravi,K.,andF.Somenzi.Fateand free will in error traces. International Journal on Software Tools forTechnologyTransfer6,2(2004), 102–116. [93] Kashyap, S., and Garg, V. Producing short coun­terexamples using crucial events. In CAV 2008 (2008), LNCS, vol. 5123, Springer, Berlin, Heidel­berg, pp. 491–503. [94] Katoen, J.-P., Khattri, M., and Zapreev, I. S. A markov reward model checker. In QEST (2005), pp. 243–244. [95]Kesten,Y., Pnueli,A.,ando.Raviv,L. Algorithmic verifcation of linear temporal logic specifcations. In International Colloquium onAutomata, Langua­ges, and Programming (ICALP-98),(1998), LNCS, vol. 1443, Springer, Berlin, Heidelberg, pp. 1–16. H. Debbi [96] Kroening, D., Groce, A., and Clarke, E. Counte­rexample guided abstraction refnement via program execution. In 6th International Conference onFor­mal Engineering Methods (ICFEM) (2004), LNCS, vol. 3308, Springer, Berlin, Heidelberg, pp. 224– 238. [97]Kuma,N.,Kumar,V., andViswanathan,M. On the complexityof errorexplanation.In Verifcation, Mo­del Checking, and Abstract Interpretation (VMCAI) (2005), LNCS, vol. 3385, Springer, Berlin, Heidel­berg, pp. 448–464. [98] Kumazawa, T., and Tamai, T. Counterexample-based error localization of behavior models. In NASAFormal Methods(2011), pp. 222–236. [99] Kurshan, R.P. Computer-AidedVerifcationof coor­dinating processes -the automata theoretic appro­ach. Princeton University Press, 1994. [100] Larsen, K. G., Pettersson,P., andWang,Y. Uppaal in a nutshell. Int.J. SoftwareTools forTechnology Transfer1,1(1997), 134–152. [101] Leitner-Fischer, F., and Leue, S. On the synergy of probabilistic causality computation and causality checking. In SPIN 2013 (2013), LNCS, vol. 7976, Springer-Verlag, Berlin, Heidelberg, pp. 246–263. [102] Leitner-Fischer,F., and Leue, S. Probabilisticfault tree synthesis using causality computation. IJCCBS 4,2(2013), 119–143. [103] Leue,S.,and Befrouei,M.T. Counterexampleex­planation by anomaly detection. In SPIN (2012), vol. 7385 of Lecture Notes in Computer Science, Springer, pp. 24–42. [104] Leue, S., and Befrouei, M. T. Mining sequential patterns to explain concurrent counterexamples. In SPIN (2013), vol. 7976 of Lecture Notes in Compu­ter Science, Springer, pp. 264–281. [105] Lewis, D. Causation. Journal of Philosophy 70 (1973), 556–567. [106] LONG, D. Modelchecking, abstraction and compo­sitional verifcation. PhD thesis, School of Compu­ter Science, Carnegie Mellon University, 2005. [107] McMillan, K., and Zuck, L. Abstract counterex­amples for non-disjunctive abstractions. In Reacha­bility Problems (2009), LNCS, vol. 5797, Springer, Berlin, Heidelberg, pp. 176–188. [108] Nopper, T., Scholl, C., and Becker., B. Computa­tionof minimal counterexamplesbyusing blackbox techniques and symbolic methods. In Computer-Aided Design (ICCAD) (2007), IEEE Computer So­ciety Press, pp. 273–280. [109] Pnueli, A. The temporal logic of programs. In 18th Annual Symposium onFoundationsof Compu­ter Science (1977), IEEE, pp. 46–57. [110] Pytlik, B., Renieris, M., Krishnamurthi, S., and Reiss, S.P. Automatedfault localization using po­tential invariants. In AADEBUG’2003, Fifth Inter­national Workshop on Automated and Algorithmic Debugging (2003), pp. 273–276. [111] Ravi, K., Bloem, R., and Somenzi,F.Acomparative study of symbolic algorithms for the computation of fair cycles. InThirdInternational Conference, FM­CAD 2000 (2000), LNCS, vol. 1954, Springer, Ber­lin, Heidelberg, pp. 162–179. [112] Ravi, K., Bloem, R., and Somenzi, F. A note on on-the-fy verifcation algorithms. In TACAS 2005 (2005), LNCS, vol. 3440, Springer, Berlin, Heidel­berg, pp. 174–190. [113] Ravi, K., and Somenzi,F. Minimal assignments for bounded model checking. In TACAS(2004), LNCS, vol. 2988, Springer, Berlin, Heidelberg, pp. 31–45. [114] Renieris, M., and Reiss, S. Fault localization with nearest neighbor queries. In ASE (2003), IEEE Computer Society, pp. 30–39. [115] Saidi, H., and Shankar,N. Abstract and model check while you prove. In CAV 99 (1999), LNCS, vol. 4111, Springer-Verlag, Berlin, Heidelberg, pp. 219– 242. [116] Schmalz, M., Varacca, D., and Volzer, H. Coun­terexamples in probabilistic ltl model checking for markovchains. In International Conference on Con­currency Theory (CONCUR) (2009), LNCS, vol. 5710, Springer, Berlin, Heidelberg, pp. 787–602. [117] Schuppan,V., and Biere, A. Shortest counterexam­ples for symbolic model checking of ltl with past. In 11th International Conference on Tools and Algo­rithms for the Construction and Analysis of Systems (2005), LNCS, vol. 3440, Springer, Berlin, Heidel­berg, pp. 493–509. [118] Shen,S.,Qin,Y.,andLi,S.Bug localizationofhar­dware system with control fow distance minimiza­tion. In 13th IEEE InternationalWorkshoponLogic and Synthesis (IWLS 2004) (2004). [119] Shen, S., Qin, Y., and Li, S. Localizing errors in counterexample with iteratively witness searching. In ATVA(2004), LNCS, vol. 3299, Springer, Berlin, Heidelberg, pp. 456–469. [120] Shen,S.,Qin,Y.,andLi,S. Minimizing counterex­ample with unit core extraction and incremental sat. In Verifcation, Model Checking, and Abstract Inter­pretation (2005), LNCS,vol. 3385, Springer, Berlin, Heidelberg, pp. 298–312. Informatica 42 (2018) 145–166 165 [121] Shen, S., and Y. Qin, S. L. Localizing errors in counterexample with iteratively witness searching. In ATVA 2004 (2004), LNCS, vol. 3299, Springer, Berlin, Heidelberg, pp. 459–464. [122] Shen,S.-Y.,Qin,Y.,andLi,S.Afast counterexam­ple minimization approach with refutation analysis and incremental sat. In Conference on Asia South Pacifc DesignAutomation(2005), pp. 451–454. [123] Sheyner, O., Haines, J., , Jha, S., Lippmann, R., and Wing, J. Automated generation and analysis of at­tack graphs. In IEEE Symposium on Security and Privecy 2002 (2002), pp. 273–284. [124]Tan,J.,Avrunin,G.,and Leue,S. Heuristic-guided counterexample search in favers. In 12th ACM SIGSOFT international symposium onFoundations of software engineering (2004), pp. 201–210. [125] Tan, L., Sokolsky, O., and Lee, I. Specifcation-based testing with linear temporal logic. In Procee­dings of IEEE International Conference on Informa­tion Reuse and Integration (2004), pp. 493–498. [126] Tarjan, R. E. Depth-frst search and linear graph al­gorithms. SIAMJournal of Computing1,2(1972), 146–160. [127]Tip,F., and Dinesh,T.Aslicing-based approach for locating type errors. ACMTransactions on Software Engineering and Methodology1 10,1(2001), 5–55. [128] Touati, H. J., Brayton, R. K., and Kurshan, R. P. Testing language containment for. automata using bdds. In InternationalWorkshop onFormal Methods in VLSI Design (1991), pp. 371–382. [129] Valmari, A., and Geldenhuys, J. Tarjans algo­rithm makes on-the-fy ltl verifcation more eff­cient. In Jensen, K., Podelski, A. (eds.) TACAS (2004), LNCS, vol. 2988, Springer, Berlin, Heidel­berg, pp. 205–219. [130]Vardi,M.,Wolper,P.,andYannakakis,M. Memory-effcient algorithms for the verifcation of temporal properties. Formal Methods in System Design1,2 (1992), 275–288. [131] Visser, W., Havelund, K., Brat, G., Park, S., and Lerda, F. Model checking programs. Automated Software Engineering Journal 10, 2 (2003), 203– 222. [132]Wang,C.,Yang,Z.,Ivancic,F.,and Gupta,A. Who­dunit? causal analysis for counterexamples. In 4th International Symposium,ATVA (2006), LNCS, vol. 4218, Springer, Berlin, Heidelberg, pp. 82–95. [133] Wimmer, R., Braitling, B., and Becker, B. Coun­terexample generation for discrete-time markov chains using bounded model checking. In Verifca­ tion, Model Checking, and Abstract Interpretation (2009), LNCS, vol. 5403, Springer, Berlin, Heidel­ berg, pp. 366–380. [134] Wimmer, R., Jansen, N., Abraham, E., Becker, B., and Katoen, J. Minimal critical subsystems for discrete-time markov models. In TACAS (2012), LNCS, vol. 7214, Springer, Berlin, Heidelberg, pp. 299–314. [135] Wimmer, R., Jansen, N., and Vorpahl, A. High-level counterexamples for probabilistic automata. In Quantitative Evaluation of Systems (QEST) (2013), LNCS, vol. 8054, Springer, Berlin, Heidelberg, pp. 39–54. [136] Xie, A., and Beerel, P. A. Implicit enumeration of strongly connected components. In Internatio­nal Conference on ComputerAided Design (1999), pp. 37–40. [137] Zeller, A. Yesterday, my program worked. today, is does not. why? In ACM Symposium on theFounda­tions of Software Engineering (1999), pp. 253–267. [138] Zeller, A. Isolating cause-effect chains for computer programs. In ACM Symposium on theFoundations of Software Engineering (2002), pp. 1–10. Evaluation of Medical Image Algorithms on Multicore Processors Damir Demirović Department of Computer Science and Informatics, Faculty of Electrical Engineering, University of Tuzla, Bosnia and Herzegovina E-mail: damir.demirovic@untz.ba, http://www.fe.untz.ba Zekerijah Šabanović Medical Faculty, University of Tuzla, Bosnia and Herzegovina E-mail: zekerijah.sabanovic@untz.ba, http://www.medf.untz.ba Keywords: medical image processing, multicore processor, GPU, GPGPU, filtering, image registration Received: April 26, 2016 Introduction: In recent time medical image processing and analysis became an essential component inclinical practice. Medical images contain huge data to process due to increased image resolution. Thesetasks are inherently parallel in nature, so they naturally fit to parallel processors like Graphics Processing Unit (GPU). In this work several commonly used image processing algorithms for 2-D and 3­D were evaluated regarding the computation performance increase using the GPUs and CPUs on a personal computer. For tested algorithms, GPU outperforms CPU from 1.1 to 422 times. Povzetek: V zadnjem času je obdelava in analiza medicinskih slik postala bistvena sestavina v klinični praksi. Medicinske slike vsebujejo ogromne količine podatkov, vendar je procesiranje slik vzporedne narave, posebej primerno za obdelavo z grafično procesno enoto (GPU). V tem delu smo ocenili več pogosto uporabljenih algoritmov za obdelavo slik za 2-D in 3-D glede povečanja zmogljivosti računanja z grafičnimi procesorji na osebnem računalniku. Za testirane algoritme je grafični procesor omogočil zmanjšanje časa računanja od 1,1 do 422-krat. Introduction In the last decade parallel processing has become the most dominant for high-performance computing. Increasing the processor clock rate in single-core processors has slowed down due to the problems with heat dissipation. Application developers cannot count on Moore's law to make complex algorithms computationally feasible. Consequences are that they are increasingly shifting the algorithms to parallel computing architectures [1][2]. These architectures are multicore Central Processing Units (CPU), Graphical Processing Units (GPU) and Field-Programmable Gate Array (FPGA). The amount of data processed in clinical practice is also increasing. Increased resolution of medical images and a huge amount of data for processing is exploding. Trends like 3-D and 4-D imaging technologies used in treatment planning need a lot of computer power. Due to its nature, these tasks are inherently data-parallel, i.e. data from such dataset can be processed in parallel using multiple threads. GPUs originally designed for acceleration of computer graphics, become a versatile platform for running massively parallel computation. This is due to its nature, like high memory bandwidth, high computation throughput etc. [2]. In the year, 2004 programmable GPUs were introduced. Firstly, they could run in parallel custom programs called shaders. This is the first time to accelerate the non-graphical applications with GPUs. Today GPU become a viable alternative to CPUs in time-consuming tasks. When same computations can be performed on many image elements in parallel, so it can easily fit on GPUs. Two dominant parallel computing platforms are NVidia CUDA and OpenCL. OpenCL [3] is a software framework for writing programs that run across heterogeneous platforms like CPUs, GPUs, digital signal processors (DSPs) and FPGAs. Heterogeneous refers to systems with more the one kind of processors or cores. Both CUDA and OpenCL support heterogeneous computing. OpenCL is based on a C programming language, and it is an open standard. NVIDIA CUDA [4] is a parallel computing platform and Application Programming Interface (API), which supports programming framework OpenCL. In [5] authors gave the introduction to the GPU architecture, and its applications in image processing, software development, and numerical applications. In [2] authors review the principles of GPU computing in the area of medical physics. Segmentation of anatomical structures from image modalities like Computed Tomography (CT) and Magnetic Resonance Imaging (MRI) were given in [6]. Due to its computational complexity most segmentation procedures require vast processing power like GPU. A brief literature review of several segmentation methods is given here. In [7] authors give a review of applications for GPU in medicine, which covers the past and current trend in this field, like commonly used method and algorithm which are specific to individual image modalities. Also, in the field of medical visualization GPU can be effectively used. Algorithm Marching Cubes that extract surfaces from volumetric data was presented [8]. Fast extraction in medical applications is necessary, so near real-time applications are very desirable. Their algorithm implementation is completely data-parallel, which is ideal for application on a GPU. In [10] authors implement widely known Demons algorithm for medical image registration [16] on a GPU, for registering 3-D CT lung images. Speedups of 55 times were reported over non-optimized CPU version. In [20] authors were using OpenCL to evaluate reconstruction of 3-D volumetric data from C-arm CT projections on a variety of high-performance computing platforms, like FPGAs, graphic cards and multi-core CPUs. Three-dimensional reconstruction task in cone-beam CT, a computation complex algorithm was implemented using CUDA [21]. Book [9] covers developing data-parallel version of registration algorithms suitable for execution on GPU. Our main objective was to compare algorithms using CPU and GPU, and their assessment on a different processor architecture. Some of the most used image processing algorithms, which are suitable for algorithm parallelization, were evaluated and speedups were compared to a single core of the CPU. CPU results were used as a base for comparison of the results from the GPU. 2 Methods In this work, time-consuming algorithms were evaluated on a CPU and GPU. Algorithms for 2D and 3D were tested, and running times were evaluated. Figure 1: OpenCL platform model [3]. There are several software packages for image processing and analysis of medical images. For the purpose of this research, the different software packages were used, as described as follows. Plastimatch [11] is an open source software for image computation. The main focus is high-performance volumetric registration of medical images, such as X-ray CT, MRI, and positron emission tomography (PET). D. Demirović et al. Capabilities Processor GPU CPU GTX Intel 560Ti i5-2500 not OpenCL version 1.1 available not Compute capability 2.1 available Double precision Yes Yes Number of cores 384 4 Max clock freq. (GHz) 1.7 3.7 Global memory (MB) 1023 6 Power rating (W) 170 95 Table 1: Processor specifications. Software features include methods for medical image registration, segmentation etc. OpenMP (Open Multi-Processing) [12] is an application programming interface (API) that supports multi-platform shared memory multiprocessing programming in C, C++, and Fortran, on most platforms, processor architectures and operating systems, including Solaris, AIX, HP-UX, Linux, OS X, and Windows. OpenCL and CUDA allow heterogeneous programming model, so a typical sequence of operations is the same in both of them. In both platforms, host refers to the CPU and its memory, while device refers to GPU and its memory. Kernels are functions executed on the device (GPU) in parallel. A typical program has the following steps: declaring and allocating host and device memory, initialize host data, transfer data from the host to the device, execute one or more kernels, transfer results from device to the host. OpenCL is portable API, based on the C99 standard of the C programming language. OpenCL platform model (Figure 1) consists of a host of several computing devices which each contain several computing units. Further, a computing unit contains several processing units. The serial code runs on a Host (which is a CPU) thread, and the parallel code executes in many devices (GPUs) threads across multiple processing elements. Functions executed on OpenCL devices are called kernels. Both CUDA and OpenCL support built-in functions which can take scalar and vector arguments. Native functions are built-in functions with reduced precision which is implementation defined, but with decreased execution time. Built-in functions conform to IEEE 754 compatible rounding for single precision floating point calculations. OpenCV [13] is a library of functions for computer vision. It is cross platform and released under the BSD license, written in C++ language, and supports Intel Integrated Performance Primitives (IPP) optimized routines, support for GPUs for CUDA and OpenCL. In this work nine commonly used algorithms were evaluated. First, algorithms in 2D which can be used for filtering medical images were evaluated. Medical image datasets usually come as volumes like CT image. They have usually 100 or mores slices, so running times are exceptionally high, which prevents their clinical usage in real time. All experiments presented in this work were evaluated on a PC computer with Intel CPU and NVidia GPU with 8GB of RAM memory. For the GPU implementation of algorithms NVIDIA CUDA Toolkit version 7.5 was used. CPU implementations were implemented using Microsoft Visual Studio Express 2013. Specifications of the processors for this research are given in Table 1. For the purpose of research, we choose nine image processing algorithms with frequent usage in medical practice. We have split the analysis of algorithms for 2-D and 3-D images as described in the following sections. 2.1 2-D algorithms In medical practice 2-D algorithm can be used on a single image slice or extracted images from 3-D volumes. For the purpose of this research, we choose the rotation, Gaussian filter, Sobel filter, Fast Walsh transform, Farneback method and Horn-Schunk optical flow. Image rotation is a geometric operation which maps the image pixel in an input image onto the position in an output image by rotating the image around the specified angle about an origin. Rotation is a case of an affine transformation, and it is widely used in image processing (for example image registration). Figure 2: Image used for all 2-D experiments. Gaussian filter is the most common used in filtering and have significant usage in medical applications (for example in image registration which acts as smoother). Gaussian filter was evaluated for input image of 2048x2048 with parameters sigma 10 pixels and kernel size 81. For purpose of these experiments, Gaussian kernel were implemented on the CPU and the GPU. For CPU, we used up to 4 threads with standard CPU optimizations. For these experiments the image showed in Figure 2 was used. Fast Walsh or Hadamard transform is a special case of generalized Fourier transforms, which has the same complexity like Fourier transform but without multiplications. Farneback method for computation of optical flow was presented in [14]. Optical flow was used for the finding of relative motion between two images. It can be used to recover motion for example between two organs. The method is based on approximation of each neighborhood of two frames by quadratic polynomials, using the polynomial expansion transform (images are shown in Figure 3 and Figure 4). Obtained deformation field is shown in Figure 5 and Figure 6, where colors correspond to different values of deformation obtained. Two deformations appear similar but a significant value difference can be seen in the lower and the right part of Figure 5. If we take the CPU implementation as the golden truth, the difference between these two results originates from a loss of computation precision of the GPU. Horn-Schunck is optical flow method is a classical method for finding the apparent motion in images [15]. The method assumes smoothness in the flow over the whole image and tries to minimize global energy functional which consists of two parts, intensity and regularization. The method employs iterative scheme using Jacobi method. For this experiment, image showed in Figure 3 and Figure 4 were used. Deformation field after registration obtained with this algorithm are showed in Figure 7 and Figure 8. Comparing the obtained deformation fields from two algorithms we found some small differences on the pixels on GPU image (Figure 5 and Figure 6). Figure 3: Static image used for all 3-D experiments. Results for described 2-D algorithms are given in Table 2 and corresponding Figure 9. From the results one can see that almost all algorithms, with exception of image rotation, execute faster on the GPU, and depending on the algorithm speedups are from 10x to 84x compared to one CPU thread. Significant improvements can be also obtained with some loss of the accuracy. Almost all algorithms can be run in a real-time on the GPU, and just one on the CPU. Figure 4: Moving image used for all 3-D experiments. The Sobel operator is a widely used filter in image processing for edge detection. In 2-D Sobel operator is 3x3 for one dimension, whereas in 3-D 3x3x3 for each of 3 dimensions. The result of Sobel operator is a gradient vector. The filter is separable so it can be written as product of two simpler filters. For this experiment, we used up to four CPU thread for evaluation. Speedups are given in Table 3 and Figure 10. For the best experiment we can expect the speedup of 38 times for the four CPU cores, or in worst case 141 times compared to one CPU core. From these results can be clearly seen that Sobel algorithm can benefit significantly from implementation on the GPU compared to one CPU thread. 2.2 3-D algorithms 3-D algorithms in medical practice are very important. Most of medical images are 3-D volumes and needs to be preprocessed, analyzed or visualized in some way. In the next part the five widely used algorithms in 3-D were evaluated. For the purpose of this evaluation, we implemented 3­D Gaussian filter in C programming language. Volume dimensions for tested images were 482x360x141 with kernel size of 5 and sigma 0.5 voxels. The Sobel operator in 2D has the dimension of 3x3, whereas in 3-D 3x3x3 for each of 3 dimensions. The result of Sobel operator is a gradient vector. The filter is separable so it can be written as the product of two simpler filters, thus reducing the computation time. For this experiment, the same volume was used as in the previous experiment. All 3-D registration was evaluated for the three resolution levels, with maximal 30, 50 and 50 iterations respectively. Threading in CUDA, OpenMP, and single thread have been used. For registration bspline, Demons, and affine algorithms from Plastimatch were used. Registration using bsplines falls into a category of Free-Form Deformations (FFD) in which object to be registered is embedded into bspline object [19]. Deformation of bspline object represents the transformation of the registration [17]. Affine image D. Demirović et al. registration falls into a category of linear registration, which is a composition of linear transformations with translations. In this category falls rigid transformations (translating plus rotations), rigid plus scaling and affine. Another category of non-linear registration is non-rigid, deformable, fluid elastic etc. Affine transformation preserves points, straight lines, and planes. After transformation set of parallel lines remains parallel. Affine transformations define translation, scale, shear, and rotation. Obtained deformations of Horn-Schunk algorithm are shown in Figure 7 and Figure 8, where colors correspond to different values of deformation which was obtained from algorithms running on the GPU and CPU respectively. In contrast to the 2-D Farneback method, some small differences can be spotted between the two deformation fields, which corresponds to very small error for the GPU. All results obtained with 3-D registration are showed in Table 4 and Figure 11. For these experiments, OpenMP were used with four CPU threads, except for filtering algorithms Gaussian and Sobel. Obtained speedups are from 1x to 422x depending on the algorithm. Lowest speedup is for the affine registration where CPU version of the algorithm is little faster. Highest speedup is for filtering, from 127x to 422x compared with single CPU thread. Registration algorithm Demons and affine have little or no speedup for 4x, whereas bspline have a significantly lower performance in this case. Algorithms for image registration are highly computing extensive and obtained speedup is from about 1x for affine to 15x for Demons algorithm. It is worth to mention that Demons algorithm uses Gaussian filter in each iteration to smooth the deformation field. From the running times for Demons, one can see that speedup is almost the same for CPU, which indicates the 160 140 120 100 80 60 40 20 0 Speedup single thread implementation for this algorithm. Similar to 2-D implementations, there is a trade-off between precision and running time. 3 Conclusions In this paper was presented an evaluation of speed gain using modern GPU cards compared to the standard CPU. In total, nine common used algorithms on different processors were evaluated using parallel processing for 2­D and 3-D. For the CPU up to 4 threads were used, Horn-Schunk Farneback Fast Walsh Sobel Rotation Algorithm Figure 9: Speedups for 2-D experiments. Algorithm GPU (s) CPU (s) speedup (in times) Image rotation 0.0090 0.10 0.01 Fast Walsh transform 0.0399 3.38 84 Farneback optical flow [14] 0.0116 0.50 43 Horn-Schunk optical flow [15] 1.4200 13.69 10 Table 2: Running times and corresponding GPU Speedups for 2-D experiments. 1234 Num of threads Algorithm GPU (s) Number of CPU threads (s) speedup (in times) 1 2 3 4 1 2 3 4 Sobel filter 0.0155 2.2 1.1 0.8 0.6 141 71 51 38 Table 3: Running times (in seconds) and corresponding GPU speedups for 2-D Sobel filter. depending on the algorithm implementation. For the GPU, algorithms were used with simple naive implementation, without optimization and all available cores. In almost all cases processing times decrease due to highly parallelizable algorithms. Obtained speedups varied from 1.1x to 422x depending on the algorithm. Some of the tested algorithms was not well suited to parallel implementation, i.e. their running times increased with larger number of threads. Obtained results on a GPU suffers small loss of accuracy, and show near real-time performance. Future work can evaluate the specific optimizations for CPU and GPU, instructions like SSE, AVX for CPU. Native instructions, determining the optimal local and global block size for CUDA and OpenCL and instructions with lower precision can be analyzed for the GPU. Another possibility for detecting and reducing the bottlenecks in the GPU implementation can be done using a GPU profiler. References [1] Xue X, Cheryauka A, Tubbs D. Acceleration of fluoro-CT reconstruction for a mobile C-Arm on GPU and FPGA hardware: a simulation study. Proc. SPIE 6142, Medical Imaging 2006: Physics of Medical Imaging, 61424L (2 March 2006); 2006. [2] Pratx G, Xing L. GPU computing in medical physics: A review. Medical Physics. 2011; 38(5): p. 2685-2697. [3] Khronos. OpenCL. [Online]. [cited 2016 03 21. Available from: https://www.khronos.org/opencl/. [4] NVidia. NVIDIA CUDA. [Online].; 2016 [cited 2016 03 15. Available from: http://www.nvidia.com/ object/cuda_home_new.html. [5] Couturier R. Designing Scientific Applications on GPUs: Chapman & Hall CRC; 2013. [6] Smistad E, Falch TL, Bozorgi M, Elster AC, Lindseth F. Medical image segmentation on GPUs – A comprehensive review. Medical Image Analysis. 2015; 20(1): p. 1-18. [7] Eklund A, Dufort P, Forsberg D, LaConte SM. Medical image processing on the GPU -past, present and future. Medical Image Analysis. 2013; 17(8) [8] Smistad E, Elster AC, Lindseth F. Fast surface extraction and visualization of medical images using OpenCL and GPUs. The Joint Workshop on High Performance and Distributed Computing for Medical Imaging. 2011; 2011. [9] Shackleford J, Kandasamy N, Sharp G. High Performance Deformable Image Registration Algorithms for Manycore Processors. 1st ed. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc.; 2013. [10] Samant P, Muyan-Ozcelik , Owens JD, Xia J, S. S. Fast Deformable Registration on the GPU: A CUDA Implementation of Demons. In proceedings of the 1st technical session on UnConventional High Performance Computing (UCHPC) in conjunction with the 6th International Conference on Computational Science and Its Applications (ICCSA); 2008; Perugia, Italy. p. 223-233. [11] Plastimatch. Plastimatch. [Online]; 2016 [cited 2016 04 20]. Available from:http://plastimatch.org/. Affine Demons Bspline Gaussian filter Sobel filter Algorithm Figure 11: Speedups for 3-D algorithms. Algorithm GPU (s) Number of CPU threads (s) speedup (in times) 1 4 1 4 Sobel filter Gaussian filter Bspline registration [17] Demons registration [18] Affine registration 0.0557 0.7860 41.4000 6.5100 74.7100 23.5 -100.1 -323.4 99.6 98.9 99.0 69.2 81.2 422 127 8 15 0.92 --2 15 1.1 Table 4: Running times (in seconds) and corresponding GPU speedups for 3-D algorithms. [12] OpenMP. http://openmp.org/wp/. [Online].; 2016 [cited 2016 04 02]. Available from: http://openmp.org/wp/. [13] OpenCV. OpenCV. [Online].; 2016 [cited 2016 04 15. Available from: http://opencv.org/. [14] Farneb. Image Analysis: 13th Scandinavian Conference, SCIA 2003 Halmstad, Sweden, June 29­July 2, 2003 Proceedings. In Bigun J, Gustavsson T, editors. Berlin, Heidelberg: Springer Berlin Heidelberg; 2003. p. 363-370. [15] Horn BKP, Schunck BG. Determining Optical Flow. Tech. rep. Cambridge, MA, USA; 1980. [16] Maintz JBA, Viergever MA. A Survey of Medical Image Registration. Medical Image Analysis, Volume 2 , Issue 1 , 1 -36 1998. [17] Pennec X, Cachier P, Ayache N. Understanding the demon’s algorithm: 3D non-rigid registration by gradient descent. In: Proc. MICCAI’99; 1999. [18] Thirion JP. Image matching as a diffusion process: an analogy with Maxwell's demons. Medical Image Analysis. 1998 sep; 2(3): p. 243-260. [19] Tustison NJ, Avants BA, Gee JC. Improved FFD B-Spline Image Registration. Computer Vision, IEEE International Conference on. 2007; 0: p. 1-8. [20] Siegl C, Hofmann HG, Keck B, Prmmer M, Hornegger J. OpenCL: a viable solution for high-performance medical image reconstruction? Proceedings of SPIE (Medical Imaging 2011: Physics of Medical Imaging), Lake Buena Vista, Florida, USA, 12 -17 Feb 2011, vol. 7961, pp. 79612Q, 2011 [21] Scherl H, Keck B, Kowarschik M, Hornegger J. Fast GPU-Based CT Reconstruction using the Common Unified Device Architecture (CUDA). In Nuclear Science Symposium Conference Record, 2007. NSS '07. IEEE; 2007 Oct. p. 4464-4466. AModifcationofthe Lasso MethodbyUsingthe BahadurRepresentationfor the Genome-Wide Association Study LevV. Utkin Peter the Great St.PetersburgPolytechnic University, St.Petersburg, Russia E-mail: lev.utkin@gmail.com Yulia A. Zhuk ITMO University, St.Petersburg, Russia E-mail: zhuk_yua@mail.ru Keywords:data analysis, feature selection, Lasso, Bahadur representation, genome-wide association study Received: September 24, 2016 AmodifcationoftheLassomethodasapowerful machinelearningtoolappliedtoa genome-wide associ­ation study is proposed in the paper. From the machine learning point of view, a feature selection problem is solved in the paper, where features are single nucleotide polymorphisms or DNA-markers whose asso­ciation with a quantitative trait is established. The main idea underlying the modifcation is to take into account correlations between DNA-markers and peculiarities of phenotype values by using the Bahadur representation of joint probabilities of binary random variables. Interactions of DNA-markers called the epistasis are also considered in the framework of the proposed modifcation. Various numerical experi­ments with real datasets illustrate the proposed modifcation. Povzetek: Predstavljena je modifkacija metode strojnega uˇcenja, imenovana Lasso. 1 Introduction Oneofthe important areaforasuccessful applicationofthe artifcial intelligence, in particular, machine learning algo­rithms,isthe computational biology which canberegarded asabasisfor manyengineering problemsin biotechnology. An interesting task clearly illustrating the application of ar­tifcial intelligenceto biotechnology problemsisagenome-wide association study (GWAS). GWAS examines the as­sociation between phenotypes or quantitative traits and ge­neticvariants or genotypes across the entire genome. In the machine learning framework, it can be regarded as one of the methods for a feature selection problem where features are the so-called single nucleotide polymorphisms (SNPs) or DNA-markers. As pointed out in [12, 15], there are some diffculties of solving this feature selection problem. First of all, the number of SNPs p is typically 10–100 times the number of individuals n in the training sample. This is the so called p>n (or large p small n)problem, which leads to diffculty of an oversaturated model. Another diffculty is that SNPs may affect phenotype in a complicated and unknown manner. For example, some DNA-markers may interact in their effects on phenotype. This interaction is called the epistatic effect. Ahuge amount of the statistical models and methods sol­ving the SNP selection problem have been developed the last decades. Apart of methods can be referred to as flter methods [1, 28] which use statistical properties of SNPs to flter out poorly informative ones. Areview of flter met-hodsinGWASis proposedbyZhangetal.[53].The t-test, Fisher criterion(F -statistics), .2-statistics, ANOVA tests are the well-known statistical methods for detecting diffe­rential SNPs between two samples in training data. Another part of methods called wrapper methods gene­rally provides more accurate solutions than the flter met­hods, but it is computationally demanding [24]. One of the well-known wrapper methods proposed by Guyon et al. [17] and called the Recursive Feature Elimination has been applied to the gene selection problem for cancer clas­sifcation. Filter methods and their modifcations as well as wrap­per methods may be effcient tools for solving the problems of GWAS. At the same time, a lot of methods of the fea­ture selection use regression models. One of the pioneering and the most well-known papers devoted to the use of re­gression modelsinSNP selectionhas been writtenby Lan­der and Botstein [27]. Methods for constructing the corre­sponding regression models can be referred as embedded methods [25]. Theyperforms feature selection in the pro-cessof modelbuildingandcoveralotof well-knownap­proaches, including the Ridge regression, Least Absolute Shrinkage and Lasso techniques [41] which are the most popular and effcient tools in SNP selection problems. The main advantage of using the Lasso method is that it per­forms variable selection and classifcation or regression si­multaneously. A lot of approaches using the Lasso met­hod and its modifcations have been developed for solving the SNP selection problem in the framework of the GWAS [13, 31, 35, 36, 40, 43]. Hayes [18] provideda comprehen­sive overview of statistical methods for GWAS in animals, plants, and humans. Various approaches to SNP selection with the Lasso algorithm and other methods can be also found in papers [16, 22, 33, 46]. The main aim of GWAS is to identify SNPs that are di­rectly associated with a trait, i.e., the standard GWAS ana­lyzeseachSNP separatelyinordertoidentifyasetofsigni­fcant SNPs showing genetic variations associated with the trait. However, an important challenge in the analysis of genome-wide data sets is taking into account the so-called epistatic effect when different SNPs interact in their asso­ciation with phenotype. Campos et al. [12] explain some shortcomings of the standard GWAS. They write that the currently identifed SNPs might not fully describe genetic diversity. For in­stance, these SNPs may not capture some forms of genetic variability that are due to copynumber variation. Moreo­ver, genetic mechanisms might involve complex interacti­ons among genes and between genes and environmental conditions, or epigenetic mechanisms which are not fully captured by additive models. Manystatistical approaches make sense under the assumption that only a few genes af­fect genetic predisposition. However, GWAS may be unsa­tisfactory for manyimportant traits which may be affected by a large number of small-effect, possibly interacting, ge­nes. Limitations and pitfalls of prediction analysis in the framework of the GWAS have been discussed in detail by Wray et al. [47] where it is shown how naiveimplementati-ons can lead to severe bias and misinterpretation of results. Infact, the epistatic effect can be viewed as gene-gene interaction when the action of one locus depends on the genotype of another locus. At the same time, there are dif­ferent interpretationsof the epistaticeffect.Afundamental critical review of different defnitions and interpretations of epistasisisprovidedby Cordell[11] whereitis pointed out that there are manyconficting defnitions of epistasis, which lead to certain problems in interpretations, namely, the statistical interaction may not correspond to the biolo­gical modelsofepistasis.As indicatedbyWanetal. [44], there are mainlythreedifferent defnitionsof gene-genein­teractions: functional, compositional and statistical epista-sis. We consider only the statistical epistasis which can be regarded as the statistical deviation from the joined effects of twoSNPs on the phenotype. At that, the individual SNPs may exhibit no marginal effects. Alot of methods dealing with epistasis effect have been developed last decades [3, 30, 52, 49, 50, 51, 54]. Compre­hensive and interesting reviews of methods detecting inte­ractingthe epistaticeffect wereprovidedbyseveral authors [7, 45]). Analyzing various modifcations of the Lasso method applied to the GWAS problems, we can point out that many effcient modifcations are based on applying spe­cial forms of the penalty function, which take into account some additional information about SNP markers and the corresponding phenotype values. Some interesting algo­rithms [33, 42] devoted to various penalty functions will be studied in the next section. The use of a specifc addi- L.V. Utkin et al. tional information allows us to improve the GWAS and is considered in the paper. In the present study, we modify the Lasso methodby ta­king into account some peculiarities of the double haploid (DH) lines of barleywhich are very important in the plant biotechnology. According to the DH method, only two ty­pes of genotypes occur for a pair of alleles. From a sta­tistical point of view, we solve a linear regression problem with binary explanatory variables. Our method is based on the well-known adaptive Lasso [56] and takes into ac­count additional information about the correlation between SNPs, frequencies of alleles and expected phenotype va­lues. We propose to use the Bahadur representation [2] by partially applying the ideas provided by Lee and Jun [29] where the authors propose to apply the Bahadur representa­tion to classifcation problems. The Bahadur representation allows us to compute joint probabilities of SNPs by taking into account the correlation between binary random varia­bles. That is another reason whywe analyze only DH lines which produce the binary genotypes. In order to modify the adaptive Lasso, we propose to assign penalty weights in accordance with expected values of the phenotype with respect to a probability mass function somehow defned on the genotype values. In other words, computing the ex­pected values of the phenotype in a special way is a main idea of the proposed method. We show that the proposed modifcation is directly extended on the case taking into account the epistatic effect. 2 The Lasso method We analyzen double haploid (DH) lines of barley or a po­pulation from n individuals. From a statistical point of view, marker genotypes can be treated as qualitative expla­natory variables, i.e., Xj =(x1j , ..., xnj)T is a predictor representing the j-th SNP, j =1, ..., p. Here xij isa binary variable, i.e., xij .{0, 1}. Aquantitative trait of interest orasetofthe phenotypevalues yi . R, i =1, ..., n, can be regarded as the responsevector Y =(y1, ..., yn)T.Wealso denote X =[X1, ..., Xp] is a genotype matrix for n lines or individuals or a predictor matrix in terms of statistics; T x=(xi1, ..., xip) is a vector of alleles corresponding to i the i-th line, i =1, ..., n. First, we focus on the standard linear regression model p X y = ßiXi + ß0 +  = Xß+ß0 + . (1) i=1 Here  isa noisevariable with the zero-valuedexpectation; ßi is the SNP effect, ß =(ß1, ..., ßp). Without loss of generality, we assume the predictors and the response are centered, and the predictors are standardi­zed, that is nnn XX X 2 yi =0,xij =0,x ij =1,Xi . Rp. i=1 i=1 i=1 This implies that the intercept is not included in the regres­sion function. The Lassoisaregularization techniquefor simultaneous estimation and variable selection [41]. The Lasso estima­tes are defned from the following quadratic programming problem: 2 ß = arg min kY - Xßk, ß.Rp subject to p X Informatica 42 (2018) 175–188 177 the performance of the Lasso. Moreover, it can be a basis for constructing the boosting Lasso [6]. The Lasso has manyinteresting properties which make the methodtobevery popular.Atthe sametime,Zouand Hastie [57] point out that in spite of success of the Lasso it has some limitations, in particular, if there is a group of variables among which the pairwise correlations are very high, then the lasso tends to select only one variable from the group and does not care which one is selected. In other words, the Lasso tends to put all the weight on the selected |ßj |. s variable. On the one hand, this is a shortcoming. Many j=1 methods have been proposed to overcome this obstacle, for example, the so-called the elastic net [57] where the esti-for some s> 0. The Lagrange formulation is p X mates are defned by 2 ß = arg min ß.Rp kY - Xßk|ßj| , + . j=1 where . is a nonnegative regularization parameter. The se­cond term is the L1 penalty which is crucial for the success of the Lasso. The Lasso estimator is usually calculated at a grid of tuning parameters of ., and a cross-validation pro­cedure is subsequently used to select an appropriate value of .. The Lasso penalizes the regression coeffcients by their L1 norm. However, in order to improve the performance of the Lasso, the regression coeffcients can be penalized individually. As a result, we write the weighted Lasso esti­mates as follows: X p j=1 However, the elastic net requires to assign an additional pa­rameter .2 whose value is a priori unknown. On the other hand, in contrast to the ridge regression which tends to se­lect all of the correlated variables and make the correspon­ding coeffcients to be equal, the Lasso selects a group of correlated variables and “isolates” it. Aspecial choice of the penalty term on the basis of some prior information about SNPs or about an exploited ge­nome selection model itself may lead to a series of useful or important properties of the regression or classifcation model corresponding to the Lasso. Liu et al. [33] tried Y - Xee 2 2 e ßj ß = arg min ß.Rp ß + .1 + .2 e ß . X p 2 ß = arg min kY - Xßk+ .wj |ßj| , ß.Rp j=1 where wj > 0, j =1, ..., p,are weights determineda priori in accordance with some rules. Alarger weight wj corre­ sponds to a higher penalty and discourages the j-th predic­ tor from the model. Conversely, a smaller weight wj exerts less penalty and encourages selection of the corresponding predictor [55]. The penalized Lasso can be reformulated as the standard Lasso problem [6]. If we introduce new covariates and re-to apply the observedfact that thereexistsa natural grou­ ping structure in SNPs and, more importantly, such groups are correlated. The authors proposed a new penalization method for groupvariable selection which can properly ac­commodate the correlation between adjacent groups. Their method referred to as smoothed group Lasso is based on a combination of the group Lasso penalty and a quadratic penalty on the difference of regression coeffcients of adja­ cent groups. Liu et al. [33] assume that SNPs are divided into J groups, each with size dj, j =1, ..., J, according to their physical locations and correlation patterns. As a results, the vector ß =(ß1, ..., ßJ ) is defned for groups gression coeffcients as of SNPs,but not for separate SNPs, ßj is the vector of pa­rameters corresponding to SNPs from the j-th group. The e e xij ßj then the weighted Lasso problem can be rewritten as fol­ = xij/wj,i =1, ..., n,= ßj wj, X authors consider the quadratic loss function of the form: J 2 lows: l(ß)= Y - Xjßj . Y - Xee X e p ß.Rp j=1 Here Xj is an n×dj matrix corresponding to the j-th group [33]. There are two main diffculties of using the above where ß eand X earethevectorandthe matrixwith elements considered method. First,itis ratherhardfromthe compu­ 2 j=1 e ßj ß = arg min ß + . , ßj and xeij, respectively. tation point of view. Second, we have to know a priori the Zou [56] proposed one of the methods for determining grouping structure SNPs. the weights wj such that wj =1/ |ßinit,j |, where ßinit,j is An interesting approach for dealing with correlated co-a prior estimator of ßj, for example, the least square esti-variates was proposed by Tutz and Ulbricht [42]. Their mator. The corresponding Lasso problem is referred as the method utilizes the correlation between predictors expli-adaptive Lasso, and it has manynice properties improving citly in the penalty term. Coeffcients which correspond to pairs of covariates are weighted according to their marginal correlation. The correlation based penalty is given by p-1 XX (ßi - ßj )2 (ßi + ßj )2 Q.(ß)= . + . 1 - .ij 1+ .ij i=1 j>i Here .ij denotes the empirical correlation between the i-th and the j-th predictors. If we have the positive correlation, i.e., .ij › 1, then the frst term in the sum becomes domi­nant. When .ij ›-1, thenthe second term becomes do­minant. Both these cases lead to the approximate equality ßi . ßj. In case of uncorrelated predictors and .ij › 0, the corresponding model is reduced to the ridge regression. Another model proposedbyPark and Hastie [38] con­structs setsof indicators representingalltheavailablefac-tors and all possible two-way interactions in order to ft gene-interaction models with the data consisting of ge­notype measurements and a binary response. The obtained grouped variables are used in the path-following algorithm for the group-Lasso method. In order to take into account different probabilities of fe­ature values, in particular, to take into account the allele frequency, Zhou et al. [55] proposed a weighted Lasso penalty in the Lasso method such that the weights are as­signed in accordance with the following sources of prior knowledge. First, Zhou et al. [55] considered genotyping errors such that the unreliable variants should be penali­zed more. Second, theypointed out that the allele frequen­cies can be used in accordance with an idea of Madsen and Browning [34] where it was proposed to take the weight p w =2.(1 - .) for a variant with population frequency . by arguing that this scheme assigns smaller penalties to rarervariants as suggestedby classical population genetics theory. 3 The proposed method 3.1 Motivationfora new penalized method The considered in previous sections modifcations of the Lasso areeffcient toolsfor solvingtheGWASandSNP se­lection problems. Their performance has been experimen­tally shown by manyauthors [13, 22, 23, 31, 33]. Howe­ver, every real application problem has some peculiarities whose accounting might improve the regression method. Let us mention these peculiarities. 1. First of all, our aim is not to fnd the “best” regression model for the given information, but to select SNPs which impact on the smallest (largest) values of the phenotype, forexample, on the heading date early fo­weringofbarleyinthestudied applications.Thisdoes not mean that the whole ftted regression model is not importantforus.Wehaveto combinetwoaboveaims. This can be done by introducing the weighted Lasso penalties of a special form. This form has to take into L.V. Utkin et al. account in the frst place the smallest values of phe­notype. Smaller values of the phenotype produce lar­ger weights, whereas larger values of the phenotype should be also considered. It would seem that we can assignthe weightsto phenotypevalues with respectto their closeness to the minimal phenotype value. Ho­wever, the phenotype values are random and depend on environment conditions. Moreover, the smallest phenotype value does not mean that its value is cau­sed by the corresponding genotype. This implies that we cannot assign weights to the available phenotype values. The main idea underlying the method is to as­sign weights to expected values of the phenotype with respect to a probability mass function somehow def­ned on thegenotype values. 2. The genotype values corresponding to every SNP in the studied application make up a binary vector. The dependence of SNPs leads to dependence of the cor­responding binary vectors which can be estimated. 3. The allele frequencies and correlations indirectly im­pact on the smallest values of the phenotype. 3.2 Amethodfor computing weightsfor the Lasso By extending the ideas proposed by aforementioned aut­hors [34, 42, 55], we defne the weighted Lasso penalty in a new way. The main idea is the following. We defne the average contribution of every SNP to the mean phe­notype value. These contributions or their function are no­thing elsebut the weights wk in the adaptive Lasso. They have to take into account the probabilities of alleles, the correlations between SNPs and the phenotype values. The next question is how to determine the average contribution of every SNP. It can be carried out as follows: 1. For every genotype vector xj (the j-thline), we com­pute joint probabilities .(xkj,xij) of all pairs (k, i) of SNPs by taking into account correlations between pairs of random variables (SNPs). 2.Forevery pair (k, i), we compute the mean phenotype value Rki as the expectation of phenotypes with re­spect to the joint probabilities .(xkj ,xij ) over all li­nes or individuals. 3. The average contribution of every, say k-th, SNP into the phenotype is computed by averaging the mean phenotype values Rki over all i =1, ..., p. 4. The weights or their function for the adaptive Lasso are defnedby theaverage contributions. Below we consider every step in detail. 3.3 Bahadur representation The main idea for using the joint probability .(xjk,xji) is to take into account the correlation between SNPs with in-dices k and i.For every pair of SNPs Xk and Xi, we have to determine the joint probability .(xkj,xij), i =1, ..., p, i 6k, of the j-th individual. It can be computed by using = the so-called Bahadur representation proposed by Bahadur [2]. The Bahadur representation takes into account the cor­relation between binary variables, and it can be written in the case of two binary variables with numbers k and i as .(xk,xi)= .xk (1 - .k)1-xk · .xi (1 - .i)1-xi ki × (1 + .kiukui) . (2) Here .k is the probability of an allele for the k-th SNP or its allele frequency, i.e., .k = Pr{xk =1}; .ki is the correlation coeffcient between the k-th and the i-th SNPs which is defned as .ki = E[UkUi], where the random standardizedvariable Uk takes thevalues uk such that there hold Xk - .k xk - .k Uk = p,uk = p. .k(1 - .k) .k(1 - .k) Note that the frst term in the right-hand side of the ex­pression for .(xk,xi) represents the joint probability mass function under condition that variables Xk and Xi are sta­tistically independent. The second term includes the inte­raction from the frst order up to the second. Note also that Uk shouldbeevaluatedby estimating .k. The corresponding estimates of parameters .k, uk, .ki denoted as .bk, ubk, .bki are computedby meansofthe follo­wing expressions: nn XX .bk = xkl/n, .bki = ubkl ubil/n, l=1 l=1 where n is the number of individuals and (xkl - .bk) ubkl = p.bk(1 - .bk) is the l-th observed value of variable Uk. It should be noted that the Bahadur representation can be written also for joint probabilities of three, four, etc. variables. [32] mention a property of the Bahadur repre­sentation such that the joint probability distribution of any subset x1,x2, ..., xt can be written as follows: t Y .(x1, ..., xt)= .xi (1 - .i)1-xi i i=1 Q.{1,...,t}, |Q|.2 k.Q . . X Y × .1 + .Q .uk . Here .Q represents .i1,...,ik if Q = {i1, ..., ik} and |Q| denotes the number of elements in Q. The main disadvan­tage of the Bahadur representation is the large number of Informatica 42 (2018) 175–188 179 parameters and hard computations required for getting the probabilities. Therefore, we restrict our study only by pro­babilities of two variables. It should be noted that the Bahadur representation has been used in some classifcation models. One of the inte­resting models for discriminant analysis of binary data was proposed by Lee and Jun [29]. The main contribution of [29] is that they proposed to take into account the corre­lation between variables or, more exactly, estimates of the correlationby meansof the Bahadur representation. There are pros and cons of using this model when the number of variables is larger than the number of observati­ons.Forexample, Bickel andLevina[4] suppose that clas­sifcation rules ignoring the correlation structure often per­form better in this case. However, Lee and Jun [29] show by means of various experimental studies that the correla­tion should be taken into account in all cases. In spite of arguments of [29] in defense of the correla­tion analysis for high-dimensional data, there is a risk of incorrect estimates of interactions of the large order. More­over, it is practically impossible to compute the correspon­ding joint probabilities when the number of SNPs is rather large. Therefore, we propose an approach which partially uses joint probabilities of variables and partially takes into account the correlation between the variables. 3.4 Average contributions of SNPs In order to determine the average contribution of the k-th SNP into the mean value of the phenotype, we consider all possible pairs of SNPs such that one of the SNPs in every pair is the k-th SNP, i.e., we areinterestingin consi­dering p - 1 pairs of SNPs with numbers (k, 1), ..., (k, k - 1), (k, k + 1), ..., (k, p). Every pair, say (k, i), determines a mean phenotype value Rki corresponding to this pair of SNPs as follows: P n .(xkj ,xij )yj Rki = Pj=1 . (3) n .(xkj ,xij) j=1 In other words, we can compute the expected phenotype value under condition thatevery phenotypevalue yj is pro­duced by the subset of the genotypes corresponding to the k-th and the i-th SNPs. The measure Rki can be regarded as a contribution of the k-th and the i-th SNPs to the mean phenotype value. e Then the contribution of the k-th SNP denoted by Rk into the mean phenotype value can be determined through averaging the measures Rki, i.e., it is computed as p X 1 e Rk = Rki. (4) p - 1 i=1,i6 =k It is obvious that the smaller values of the measure Rek give us signifcant or top ranked SNPs and exert less pen­alty wk,i.e., we can introduce an increasing functiong such that wk = g (1/ |ßinit,k|) . One of the possible functions which will be used in nu­merical experiments is !-q eRek Rk - mink=1,...,p wk = . (5) ee maxk=1,...,p Rk - mink=1,...,p Rk Here q is a positive real which defnes how changes of ee the difference between Rk and mink=1,...,p Rk impact on changes of weights wk. The number q can be regarded as a tuning parameter whose optimal value can be obtained by means of the cross-validation procedure. In sum, the obtained weights take into account the cor­relation betweenSNPs, the allele frequencies, binary data and the fact that the smallest (largest) values of the phe­notype are more important in comparison with other va­lues because we are looking for the SNPs which impact on the values of some trait with predefned properties, for ex­ample, the heading date of barley should be as small as possible. At the same time, we do not need to directly use the obtained weights and to implement the adaptive Lasso algorithm. It has been mentioned in the previous section that the adaptive Lasso can be transformed to the standard Lasso by means of introducing new covariates xeij = xij /wj . Finally, we write the following SNP selection algorithm. Algorithm1 The SNP selection algorithm. Require: Y =(y1, ..., yn)T is the response vector (phe­notype values), X =[X1, ..., Xp] is the binary predictor matrix (genotype values). Ensure: ß =(ß1, ..., ßp) is the vector of the regression coeffcients (degrees of the SNP effect). repeat k ‹ 1 Compute joint probabilities .(xjk,xji), i =1, ..., p, i 6k, for all j =1, ..., n, by means of the Bahadur = representation (2) Compute the mean phenotype values Rki, for all i = 1, ..., p, i 6 = k, by means of (3) Compute the average mean phenotype value Rek by means of (4) Compute the weights wk by means of (5) Compute new variables xeik = xik/wk,i =1, ..., n. until k>p Compute ßeopt by using the standard Lasso with ß eand X einstead of ß eand X. e Compute ßk = ßk/Rek, k =1, ..., p. Let us indicate the main virtues of the proposed met­hod. First of all, it does not require to develop special algo­rithms for solving the optimization problem for computing the vector of regression coeffcients ß. The obtained pro­blem is solved as the standard Lasso algorithm after refor­mulating the penalized Lasso. Second, the method is rather general because we could changetheweightsin(5)in accordancewithourgoal.For L.V. Utkin et al. example, in one of the applications, we have aimed to mi­nimize the mean heading date of barley as the mean phe­notype value. However, we could aim to maximize, for ex­ample, the amount of grain protein. In this case, we change (5) by taking decreasing function g as follows: !-q e maxk=1,...,p Rk - Rek wk = . ee maxk=1,...,p Rk - mink=1,...,p Rk e Here the larger values of the measure Rk give us more sig­nifcant SNPs and exert less penalty wk. Third, we consider not only correlations between SNPs, but also joint probabilities accounting for correlations. The joint probabilities are more informative in comparison with the correlation coeffcients. Fourth, we havesimplifed procedures for computing the joint probabilities. This substantially reduces the computa­tion time. 3.5 The proposed method with epistatic effect Alot of studies devoted to the epistatic effect (see, for ex­ample, [5]) consider extension of the so-called main effect model (1) on the interaction model which can be written as p XX Y = ßiXi+ ßijXiXj +ß0 + . (6) i=1 i 0 are weights penalizing the additional parameters ßij , i..>0...... ..> , s is the action taken by .. the agent s .{0,1} (where 0 means cooperate and 1 means defect), ncp is the number of players who chose to cooperate, N is the size of the player population and b is the defection temptation, the constant c is used to ensure that the cooperation reward is less than the defection reward. [14], [15], studied the emergence of cooperation in a NPPD. The authors used different agent personalities and neighbourhoods in order to investigate their impact on the evolution of the game outcome. The experiment used different agent types with different initial co-operators ratio; this showed that for the case where all agents are Pavlovians (repeating actions that give them more satisfaction), the aggregate outcome of the game can be predicted for any number of agents and any payoff function. The choice of the agent’s neighbours also has a big influence on the game equilibrium. [16] investigated the effect of social welfare preference on the emergence of cooperation among agents placed on a BA [17] network. The authors proposed a model where some of the agents also take into consideration social welfare and not only their payoff received from the game. Agents do not only care about their own payoff, but also the payoff of their neighbours. [18] describes the use of a NPPD game to investigate the cooperation in a socio geographic community. The use of NPDD for environmental modelling has proved to be suitable since the exploitation of a shared ecological resource can be formulated as a tragedy of the common [19]. Each actor tends to maximise its profits by exploiting a shared ecological resource. Thus, a tragedy of the common arises. [20] uses a PD model to review Porters’ hypothesis, which studies the relationship between productivity and eco-friendly technologies. The work models how strict environmental regulations can enhance innovation for a less polluting technology. Firms have two actions which are: to invest in a new less polluting process or to continue using the old one and be penalised according to the governmental regulations. [21] used a version of NPPD to investigate the cooperation in international environmental negotiation to reduce CO2 emissions. [22] presents an evolutionary game theory approach to study the influence of the ecological dynamic and payoff structures over the emergence of cooperative behaviour between landowners. The landowners are modelled as selfish agents aiming to maximize their profit by managing the number of deer on their lands. The main novelty of our approach is the inclusion of human decisions as a key element for simulating the air pollution evolution. We model the managers of the emission sources of pollutants as autonomous agents. These agents aim to maximise their own profit and we investigate this effect on air quality. The designed system helps investigating the efficiency of the regulatory rules used by air pollution controlling agencies for maintaining the air quality. This is very important because it helps the environmental agencies to assess their air pollution controlling policies. The paper is organised as follows: The methodology is presented in section (2) that describes a MAS approach for designing an air pollution agent based simulator. Subsection (2.1) presents the representation of space and time. Subsection (2.2) describes the dispersion model and the Artificial Neural Network (ANN) prediction model. Many conceptualizations have been proposed to represent a socio-environmental system [22]. Generally, a socio-environmental simulation system can be represented as an interconnection of three components (or subsystems); each one is represented by a set of variables (attributes) forming its state at time t. The ecological component models the biotic (living) and abiotic (non-living) parts. The economic component represents the economic view point and groups the economic variables. The social component represents the human social networks such as decision-makers, firms, government agencies and consumers. A change in the state variable of each component affects other systems’ state variables. For example, the increase in demand for a certain kind of fish, leads fishermen to intensify their exploitation; this in turn results in changes to the biodiversity. We present a generic formalization of a socio-environmental system. A coupled social and environmental system can be expressed as a set of economic, social and ecologic state variables. The state of the system at time step t can be formulated as (2). Where Ec, Sc and Envc represent, the sets of economic, social and environmental state variables, respectively: ......=<....1,..,…,......,.. >,......=<....1,..,…,......,.. > ..........=<........1,..,…,..........,.. >(3) In our case, the environment state variables at time step t are: ..........=<..0,..,......,..,......,......,....,......>(4) ci,t is the concentration of the pollutant i, WS: wind speed, T: temperature, Hu: humidity and RF represents the rainfall, at time t. Assuming that the source of pollution at time t is modelled as: ....=<......,....,..,..,..> (5) The source produces the pollutant tc with the rate er at the geo-position (X,Y,Z). Sources are controlled by agents. Every agent has to make a decision on which action to choose among all possible actions according to the state of the environment ES and its internal state at t. Let A be the set of actions ..={..1,..2,…,....}, the result of an action is the change in the emission rate of the pollutant from the controlled source. We can define this as a function that takes the agent’s action and as a result gives the new emission rate (6). ..:..›R (6). Let .... be the action vector done by N agents at time t: ....=<.. ..0..,......> (7) Let Q be the set of possible air quality index values: ..={........_......,......,..............,........,........_........}, each of these indexes has its numerical equivalent in terms of pollutant concentration, as shown in table 1. Informatica 42 (2018) 199–209 201 The agent decision-making mechanism is given in (2.3). A test scenario is presented in section (3). Results are detailed and discussed in section (4). The paper ends with conclusions and the possible further directions of our work. 2 Model approach and architecture SOx µg NOx µg O3 µg PM10 µg Indices Category 0 – 30 0-45 0-45 0-20 1 Very Good 30-60 45-80 45-80 20-40 2 Good 60-125 80-200 80-150 40-100 3 Average 125­250 200­400 150­270 100-200 4 Bad >250 >400 >270 >200 5 Very Bad Table 1: Air pollution quality. The air quality can be modelled as a graph with T as transition function: ..(....,......,.............._..)......._.., .............._.............._.....(8) T takes as arguments the state of the system ESt and the set of actions done by N agents and accordingly it moves the system from the current state (current_q) to a new state (new_q). Under some conditions current_q may be equal to new_q, which means that actions of the agents do not change the air quality under some climatic conditions. Our simulation approach can be schematised as shown in figure 1. Agents’ actions affect the emission rate of the sources. Then the dispersion algorithm is used to compute the dispersion, the aggregated value of pollutant concentration is used with climatic parameters to forecast the next 2 hours air pollution concentration and air quality. According to these forecasts, agents are rewarded or penalised. Agents then adapt their strategies to earn more reward and reduce penalties. 2.1 The spatial and temporal scale of the simulation model The simulation uses a discrete representation of time where each simulation step represents by default 6 hours of real time. The simulation’s duration is defined in the interface and depends on the objective of the simulation (short or long term prediction). Our model is based on the hypothesis that the action of the emission controllers (reducing or increasing emissions) has an impact within k time-steps. k is a parameter whose value is provided by the user according to the scenario and available data. Since the simulation step k can be changed, we can represent a long term simulation horizon by giving k a higher value. So for example we can represent 1 step as 24 hours meaning that industrial polluters can take several days to adjust their production volumes. Setting k even higher, such as 2 weeks or 1 month would require data, which is not available, to see the evolution of the AQ. The environment is modelled as a set of 3D boxes, each one represents one km3. It can be represented as: ....={....0,…,......}, every box is localised in the geo-position point gp(x,y,z) and has attributes representing the concentration of air pollutants (cp0,…,cpv) and air quality, These attributes are used when agents are penalized according to the pollution level in the box where they are situated. In this case the position of the emission source in a box is relevant. Sources located in the same box are considered to be neighbours. 2.2 Dispersion and prediction models The dispersion model helps to measure how the pollutant will spread in the air. It is calculated according to the distance from the point source, the wind speed and direction. We used a GPD (Gaussian Plum Dispersion model), which is frequently used in atmospheric dispersion [24]. The dispersion model is run in a steady way, which means that no parameter (wind speed, emission rate and wind direction) is changed during the simulation step. This provides a series of snapshots of the situation at each step. These snapshots are then fed into the ANN model to obtain a prediction about the concentration. Since we cannot combine the two models in a continuous way our solution of taking a series of snapshots and feeding it to the ANN mimics a continuous process. The GPD simulates the dispersal from a point source emission according to the emission rate (9). ..2 (..-..)2 - -( ......,..*.. 2*..2.. 2..2.. ) ..(..,..,..,..)= *.. *[(..)+ 2.............. (..+..)2 -(2) 2.... (..)] (9) This means that the concentration of the pollutant at point (x,y,z) is calculated according to : eri,t: the emission rate in kilograms per hour of the source i in time step t. Ui: the wind speed in metres per second at time step t, .y.z: the standard deviation of the concentration distributions in a crosswind in a vertical direction, these S. Ghazi et al. two parameters are chosen according to the stability class ‘C’ in the Guifford-Pasquill scale [25], and H is the height of the source from the ground. The decay term D is given in [26] and computed according to (10). .. (1/(..* )) {..,...... >0 ..=.., (10) 1, ...... =0 Where x is the downwind distance, u is the wind speed and R is the decay coefficient. The values for R are adopted from [27] for NOx (0.45 h-1) and SOx (0.31 h-1), PM10 is not considered (R=0 and D=1). For simplification and due to lack of wind direction data, we assume that the wind direction does not change during the simulation step. The resulting pollution level from each source is aggregated and the average of each box is computed. Then the dispersion value of the pollutant is passed to an ANN prediction model as described in [28]. The ANN prediction models are designed to give a forecast of the five air pollutants and the air quality. This includes an uncertainty aspect caused by the weather conditions [29]. The ANN predictor uses the aggregated air pollution concentration value given by the dispersion model of each source and the four climatic parameters: wind speed, humidity, temperature and rain fall. These parameters greatly influence the pollutant concentration in the air [30]. O3 is a secondary pollutant, which means that it is not emitted by sources, but results from the photochemical interaction between SOx, COx, and organic components. Therefore, we used SOx and COx dispersion information to predict the O3 concentration. For each pollutant a RBF (Radial Basis Function) network is designed and trained. The RBF is composed of three layers, the first layer is connected to the input of the network and its output is connected to the hidden layer, the neurones in the hidden layer have the RBF as the activation function. The outputs of the hidden layer are linearly combined to obtain the output of the network. Using a training data set, the objective is to find the optimal combination between the number of neurons in the hidden layer and the weight of each input. By increasing the number of the neurons in the hidden layer the algorithm [31] gives the optimal topology of the network. This is why many topologies are tested and only the best of them are taken. During the training step, each network receives as input a vector of the climatic condition parameters and the concentration of the pollutant at time t. Each network generates the desired output that is the value of pollutant concentration at time t+PredictionHorizon. The forecast given by each network is passed as input to predict the air quality index using a MLP (Multi-Layered Perceptron). The MLP network is trained using the local air quality standards as shown in table 1. Air quality predictions for the different pollutants are obtained on a t+12 hours basis and give the most probable air quality category. The MLP model receives the predicted values of the five pollutants, COx, NOx, O3, PM10, and SOx, and predicts the index values for air quality ranging from 1 to 5, with 1 being very good and 5 being very bad. To train the MLP network we used a Levenberg-Marquardt back-propagation algorithm. The MLP network final topology, obtained after several trials, is: 5 neurons in the first hidden layer, 10 in the second and finally a linear neuron for the output layer. The accuracy of the ANN models are given in table 2, and is calculated using one year’s worth of data according to RMSE (Rooted Mean Squared Error) formulated in (11): ...... . (....-....)2 ...=1 ........=(11). ...... Where lng is the length of the vectors, P and R are the predicted and measured values, respectively. The performances are computed using a validation data set that was not used in the training of the ANN models. Model Topology [# of input neuron -# of hidden neuron] Validation error ( RMSE) PM10 [10-320] 16.1945 µg/m 3 SOX [10-90] 3.1618 µg/m3 NOX [10-105] 9.7277 µg/m3 COx [10-45] 0.1220 µg/m3 O3 [10-180] 39.8238 µg/m3 Table 2: Validation error of the ANN prediction models using the validation data set. 2.3 Decision-making mechanism Based on its internal state and the state of the environment, an agent has to choose an action to perform among all possible actions in order to reach its goals. This process is called decision-making. [32] presents a review of methods used for modelling decision-making in a coupled environmental and social system. Our system supports two cooperation strategies (centralized and evolutionary game) each one defines a decision-making mechanism. The centralized strategy (CS) is based on defining a central agent that represents the air pollution controlling agency. The central agent takes decisions according to the current air pollution level. The second strategy is based on an evolutionary game, where agents are rewarded and penalized according to the pollution levels; they make decisions according to their rewards. In our system, the cooperation strategy is defined within the simulation parameters. 2.3.1 Centralized Strategy (CS) The task of maintaining the air quality is assigned to an agent, which represents the air pollution control agency. It uses the GPD and ANN models to predict the air quality and pollutant levels, and according to the predictions it sends a reduce emission message to the emission agents. Then it will check the air quality. It will continue doing this until the air quality is improved to reach the air quality index goal. The central agent has absolute authority and its orders are executed by the emission source controllers. Agents communicate their emission rate at each simulation step to the central agent. This strategy is based on the communication between agents. We assume that agents are rational and have an Informatica 42 (2018) 199–209 203 environmental-responsible personality; this means they favour air quality improvement over their own interests and communicate their exact emission rate to the central agent. 2.3.2 Evolutionary Game Cooperating Strategy In the EG strategy, every agent has its own goals (earning more rewards and keeping its emission rate high) and shares a global goal of maintaining air quality with other agents. The appreciation function defined as: ......:..›..,allows comparing the air quality state at each step of the simulation. A global goal GG can be defined as (12). This means finding a set of actions .... to be performed by agents at time t, which allows the system to move to a new state of air quality ....+1 that is better than the current state. ...... ={..(....,....,....+1),......(....+1)>......(....)} (12) An agent participates with other agents in the NPPD game, its own goal is to maximise its reward earned from the game. We adopted the approach of [33], where agents keep traces of their L previous steps (actions, rewards and its neighbours’ rewards). To update the probabilities to increase or decrease the emission, we used [18] method. At each time step t the agent computes its weighted payoff, according to (13), and tries to maximise it (as its utility function) by taking it into consideration when computing its probability to increase or decrease its emission rate, respectively according to (14) and (15). .. ......(..)=. ....*....(..) (13) ..=1 Where: wi is the weighting parameter where ..-1 . =1 and ...,..(..<..›.... >....), Mi(t) is ......=1 ........ the payoff for the agent i for the time step t. ......(..+1)=......(..)+(1-......(..)*....(..) ,........=0............(..)>0 { ......(..+1)=(1-....(..) )*......(..), ........=0............(..).0 (14) ......(..+1)=......(..)+(1-......(..)*....(..) ,........=1............(..)>0 { ......(..+1)=(1-....(..) )*......(..), ........=1............(..).0 (15) Where: Pci and Pdi are respectively the probability to decrease (s=0) and increase (s=1) emissions for agent i, .i(t) is the learning rate of agent i at time step t, s is the strategy played at time t. The learning rate is updated according to (17): 0........,.. =....,..+1 ..-1 =. { (16) .... ..=1 1........,.. .....,..+1 ....(..+1)=....(..)+0.015 ........=0 {....(..+1)=....(..)+0.010 ........ >..-1 (17) ....(..+1)=....(..)-0.010 ........ ...-1 Where Di is the i-th agent actions homogeneity indicator, at time step t, Xi,j is j-th action of the agent i. Di is used to compare the last L actions of the agent. This is used to keep the agent from changing its actions. Agents are influenced by their neighbours, at each time; the average reward of the neighbours is calculated according to (18). ........................h............ ......(..) = (. ....(..))/ ..=1 ........................h............ (18) Where Mj(t) is the payoff of the neighbour j and numberOfneighboursi is the number of neighbours for the i-th agent. We keep a trace of the nP of the L previous simulation steps and we compute their average in avgnP. The agent then uses the probabilities Pc, Pd and the average reward of its neighbours to choose an action according to (19): 1,.......... <........................(..+1)>......(..+1), ........(..)=0,....(..+1)={ 0,........ { 0,.......... <........................(..+1)>......(..+1), ........(..)=1,....(..+1)={ 1,........ (19) At each simulation step, every agent gets a reward or penalty according to its actions and according to the pollution level. We have adopted the payoff curve (1) with b=2 and c=-0.5, but in the general case these parameters can be defined by the user. When the pollution level is higher than the maximum allowed value, the participation of the agent to the current level of the pollution ....(..)is computed according to (20). ......(..) ....(..)= ,......(..)>.......... (20) ......(..)-.......... Where, ERi(t) is the emission rate of the i-th agent at time t, PLe(t) is the pollution level of the pollutant e at time t and PLmax is the maximum allowed value for the pollutant level according to the regulation and local standards. The penalty for agent i at time step t is calculated according to (21): 1 ....................(..)=....................(..-1)+(1- )(21) ......(..) Two penalising strategies were used; the first uses (21) and is a cumulative penalty. This means that the penalties from each step are kept and the agent is penalised as long as it continues to increase its emission. The second penalising method is not cumulative, and agents are penalised just according to the current simulation step. The reward of agent i, at the current time step t is computed according to (22), we compute the number of agents who choose to decrease their emission denoted ncp, after that we compute u use as defined in equation (1). ..(......) ........=0 ....(..)={ (22) ..(......)-....................(..) ........=1 3 Simulation scenarios using data from the region of Annaba Annaba is a very industrialized region specialising in steel industries. The steel complex of Hadjar is located 12 kilometres south of the city of Annaba. The air pollution spreads over a radius of 6 km. According to [34], the complex annually releases into the atmosphere: 36890 tons of particles, 845 t of NOX, 30895 t of COx, 2260 t of SOx and 3093 t of NOx. The petrochemical station (ASMIDAL) produces fertilizers and pesticide products that have a big influence on air quality. 5 industrial zones, that contain hundreds of factories, are very close to the urban area and have a large impact on air pollution. The seaport is located in the centre of the city and attracts a lot of heavy transport, which also leads to deterioration in the air quality. The local pollution agency network provided hourly data for a two-year period from 01/01/2003 to 31/12/2004. The concentrations of air pollutants that have been continuously monitored are: Ozone (O3), Particulate Matter (PM10), Nitrogen Oxides (NOx), and Sulphur Oxides (SOx). The dataset also includes four meteorological parameters: Wind Speed (WS), Temperature (T) and relative Humidity (H). Daily rainfall measurements (RF) were also provided by the water management agency. The 2003 dataset was used for training the ANN and the 2004 dataset was used for validation; this helped us to assess the performance of the model. The pollutant concentration measurements are in microgram/m3 and they have been normalised using equation (23). .... ..'..= (23) (max(....)-min(....) Where Vp is a parameter vector, min and max are functions that return the minimum and maximum values of the vector. Negative values, resulting from faulty measurements, were replaced using the mean of the previous and next values. It is impossible to discard faulty values since gaps in the time series will result in a data shift that affects the ANN training process leading to poor generalisation properties. Similarly, faulty (blank) measures for pollutants and weather parameters were replaced by an average of the v-q and u+q previous and future values respectively, with u being the faulty sample and q the number of values to take into consideration. This ensures the continuity and consistency of the time series and allows efficient training of the ANN predictors. Table 3 presents the statistical properties of the available data for different pollutants and weather parameters, for some parameters data are not available (N/A). We defined a simulation scenario for the Annaba region using the parameters in table 4. The goal levels for pollutants concentration were fixed according to the air quality index goal. For this scenario we aimed to reach a very good air quality level (Goal air quality index=1). The initial values (at t=0) for pollutant concentration and climatic parameters were fixed according to the dataset. Parameter 2003 mean 2004 mean 2003 STD 2004 STD Max value PM10 µg/m3 51.70 27.76 51.66 26.38 508 NOx µg/m3 14.50 N/A 25.01 N/A 435.0 SOx µg/m3 7.60 N/A 14.78 N/A 190.0 CO µg/m3 1.31 N/A 0.52 N/A 12.2 O3 µg/m3 N/A 42.27 N/A 64.58 688.0 Wind Speed µg/m3 2.65 2.12 1.78 1.27 9.6 Humidity (%) 63.52 71.92 16.50 14.33 93.0 Temperature (°C) 18.96 16.82 7.76 6.30 42.1 Rainfall (mm) N/A 2.96 N/A 9.27 73.9 Table 3: Statistical properties of the used dataset. For the EG strategies we fixed the initial proportion of cooperating agents (agents choosing to decrease emissions) to 0.5, this means that 50% of the agents decrease their emission at t=0. The value of this parameter was chosen following the work of [14] and [15]. The proportion will change during the simulation according to the game outcome. The prediction was for the next 2 hours, the same as the simulation step. Each source emits according to its emission rate which cannot be higher than the maximum level defined in the simulation scenario. The position of sources was randomly generated and many sources are located in the same box. Parameter Name Value Polluting activities and Policy parameters Number of PM10 sources 100 Number of SOX sources 100 Number of NOx sources 100 Number of CO sources 100 Max emission rate 2000 (gram/hour). Goal PM10 level 20 µ gram/m3 Goal SOx level 30 µ gram/m3 Goal NOx level 45 µ gram/m3 Goal O3 level 45 µ gram/m3 Number of memory steps (L) 4 steps Initial proportion of cooperating agents 0.5 Environment parameters Number of boxes 20 Temperature at t=0 12.7 (°C) Humidity at t=0 71.0 % Wind Speed t=0 2.4 m/s PM10 at t=0 13.0 µ gram/m3 SOX at t=0 17.0 µ gram/m3 NOX at t=0 2.0 µ gram /m3 CO at t=0 0.5 µ gram /m3 O3 at t=0 29.0 µgram /m3 Air Quality at t=0 2 ( Good) Total simulation time 4900 hours K Simulation step 1 step = 2 hours Prediction horizon Next 2 hours Table 4: Parameter values of the simulation scenario. Results and discussion We have built a simulator using the approach described above. We used the JADE agent framework [35] and ANN models from Encog [36]. We have defined 5 strategies: EG-CP (Evolutionary Game with Cumulative Penalty), EG-NCP (Evolutionary Game with No Cumulative Penalties), EG-NP (Evolutionary Game with No Penalty), CS (Centralized Strategy) and NC (No-Cooperation). The last one is used for comparison purposes. Using the parameters shown in table 4, we chose a strategy and ran the simulation 16 times. We then changed the strategy and ran the simulation again 16 times; since we have 5 strategies we obtain 80 Informatica 42 (2018) 199–209 205 simulations. The most explicative results are presented. For the CS and NC cases the simulator showed similar results for each run. For the EG strategies there were small differences between runs, especially concerning the proportion of cooperating agents. These changes are due to the random values used in the initialisation of some variables (neighbours rewards, first chosen action, weights, k last actions and rewards). The comparison is done according to the air quality index. Results are expressed in terms of the number of occurrences of air quality index as illustrated in figure 2, for example the number of times the air quality index equals 1 (very good). Figure 3, shows the evolution of the air quality index over time. The CS gives the best performance. With the CS the air quality index moves rapidly from bad to average and then to good and finally stabilises at very-good (which is the goal fixed in the simulation scenario). The EG-CP moves the index from bad to average, when the equilibrium is reached it stabilises in good and never reaches a very-good index. The EG-NCP strategy moves the air quality from bad to average and never improves. When penalties are not used (EG-NP) the air quality stabilises at bad. When cooperation is not used (NC), agents act selfishly and do not care about the pollution, therefore, the air quality oscillates between bad and very-bad. As the agents reach their maximum emission rate we can observe an oscillation, which is caused by the climatic conditions. The only thing that affects the pollutant concentration is the climatic conditions (the emission rate is constant); these have a big influence and are captured with the ANN model. Figure 2: Air quality index using 5 different cooperation strategies. Figure 3: Air Quality index for 4900 hours. Figure 4: Concentration of PM10 for the four tested cooperation strategies compared with the no-cooperation strategy. Figure 5 : Concentration of SOX (a), NOx (b) and O3 (c) using the four strategies and the no-cooperation scenario. Figure 4 shows the evolution of the PM10 concentration during the simulation time. The PM10 concentration shows many peaks compared with the other pollutants under the same climatic conditions. This is due to the dry nature of the weather in the Annaba area, with wildfires, and sandstorms coming from the great Sahara desert. These events have a big effect on the PM10 concentration but not on the other pollutants. The CS strategy takes less time to control the pollution level and keep it below the goal level defined in the simulation parameters. All of the EG strategies take longer, keeping it close to the goal level, but without ever reaching it. The penalising regulations have a big effect on the PM10 level. As illustrated, the EG-CP (cumulative penalising method) controls the pollution better than the non-cumulative one, and both methods perform better than the no-penalising strategy. The no-cooperation is presented in order to show the impact of cooperation on the PM10 level. Figure 5 shows the evolution of the SOx, NOx and Ozone concentrations during the simulation time using four different cooperation strategies. The CS strategy gives the best performance since the pollution concentration rapidly decreases. The EG strategies show the same performance as for PM10 and the pollution level is widely influenced by the selected penalising method. The CP strategy appears to be the best one followed by the NCP. The pollution slowly decreases, but not enough to reach the goal level if penalisation is not used. Figure 8: The proportion of cooperating agents for the EG-NCP strategy. Figure 9: The proportions of cooperating agents, for the 4 groups of agents, when penalisation is not used. The NC strategy gives the worst levels; when all agents are emitting pollution using their maximum emission rate, the pollutant level reaches alarming values and peak periods occur. Figure 6 shows how the penalising method affects the proportion of cooperating agents. For the case of EG­CP the game equilibrium is reached at time step 387 and stabilises when the proportion of cooperating agents is between 0.93 and 0.95. The EG-NCP strategy stabilises early at time step 196 and oscillates between a cooperation ratio of 0.73 and 0.80, after which the equilibrium is fixed at 0.77. The EG-NP is the slowest; the equilibrium is reached at time step 808 with a cooperation ratio of 0.57. This happens because the agents are not penalised since the strategy does not include penalising methods. Figures 7, 8 and 9 show, respectively, the proportion of cooperating agents according to the pollutant for the three penalising strategies EG-CP, EG-NCP and EG-NP. The PM10 agents gives the highest cooperation ratio. This is because many peaks occur with this pollutant and the others pollutants cooperating ratio (NOx and SOx) are influenced by the O3 concentration. The more the pollutant exceeds the allowed level, the greater the proportion of co-operators. The equilibriums are disturbed by the pollution level, because, when the pollution has not yet reached the goal level, agents agents start being penalised, and thus they tend to cooperate more. Informatica 42 (2018) 199–209 207 5 Conclusions Anthropogenic activities are among the main causes of pollution and environmental problems. These activities have to be included in the simulation models. Modelling the interaction between social and ecological components is a very important aspect. A MAS approach allows us to model the social network of human-beings sharing the exploitation of common environmental resources. Manipulating the behaviour at an individual and group level helps to gain more knowledge about the impact of human decision-making on pollution and makes the simulation more realistic. Studies treating air pollution are usually concerned with the physical aspects (concentration and dispersion of pollutant), and do not include human-decision factors on the emission sources.. In our approach, we model the decision-making activity of the air pollution emission source managers. This helps to investigate the conditions and regulations that may enhance and maintain the air quality. We used a two stage air pollution modelling method: a GPD dispersion model and an ANN forecasting model. The ANN predictor uses climatic parameters and dispersal information provided by the GPD model to make predictions. This helped to introduce the effect of uncertainty caused by the weather and made the simulation more realistic. Five cooperation strategies were tested. The centralized cooperation strategy (CS) showed the best performance, surpassing the reward/penalty strategies. However, the CS strategy needs an effective communication network between emission sources controllers and the regulation agency. Also, we assume that emission controllers communicate exactly their emission rate, which is not always the case. The reward/penalty strategies seem to be more realistic; penalising the polluting agents according to their participation during peak periods has a big influence on their behaviours. As shown in the simulation results, it helps reducing the pollution level and affects the evolution of the pollutant. Thus, air pollution regulations have a big impact on pushing the emission source controllers to take their polluting activity seriously; this is especially important during the peak periods where climatic conditions cause the pollutants to stagnate. To summarise our study helps to: (1) Model and introduce human decision-making concerning emission sources and the process of simulating air pollution evolution. (2) Evaluate the possible cooperation between the actors concerned in managing the air quality. (3) Have a prediction about the efficiency of regulation rules for preserving the air quality. (4) Investigate the impact on air quality of the decision to expand or establish a new emission points. Our work aims to provide a decision-making tool to the air pollution control agencies that will help them evaluate the regulations and policies concerning air pollution control. The current version of the system deals only with point emission sources. In future versions we aim to include line and area sources. Line sources model the road activities, whereas area sources model the waste management and agricultural sources of pollution. If data becomes available in future it could be interesting to experiment with different time representations. Fortunately the multi-agent system approach allows us to easily change to a different scale of time representation in the same simulation. We can envisage using one time representation for decisions and another for monitoring. The first can help us to see long term impacts (e.g. investing in less polluting activities), and the second can help to see the short term changes. The simulator may also be enhanced by including topographic aspects of regions since this has a big influence on the dispersion of air pollutants. In addition, including more agent personalities and exploring other cooperation strategies are also among our future plans. Our system is designed in a generic way and it could be adapted for other types of pollution such as water pollution. This could be done by changing the current dispersion and the prediction models to a water pollution dispersion model. 6 References [1] Folke, Carl. Resilience: The emergence of a perspective for social–ecological systems analyses. Global environmental change, 16.3, 2006. pp. 253-267. [2] WHO (World Health Organisation), Ecosystems and Human Well-being: Health Synthesis . WHO Library Cataloguing-in-Publication Data, 2005. [3] Kryza, M., Netzel, P., Drzeniecka-Osiadacz, A., Werner, M., & Dore, A. J. (2014). Forecasting the Ozone Concentrations with WRF and Artificial Neural Network Based System. In Air Pollution Modeling and its Application XXII Springer Netherlands, 2014, pp. 605-609. [4] Holmes, NS and Morawska, L., A Review of Dispersion Modelling and its application to the dispersion of particles: An overview of different dispersion models available. Atmospheric Environment, 40(30), 2006. pp. 5902-5928. [5] Lushi, Enkeleida, and John M. Stockie. An inverse Gaussian plume approach for estimating atmospheric pollutant emissions from multiple point sources. Atmospheric Environment, 44.8, 2010. pp. 1097-1107. [6] Aulinas Montse, Turon Claudia and Sanchez-Marre Miquel. Agents as a Decision Support Tool in Environmental Processes:The State of the Art. Whitestein Series in Software Agent Technologies and Autonomic Computing. 2009. pp. 5–35. [7] Ghazi, Sabri, Khadir, Tarek, Julie Dugdale. Multi-Agent Based Simulation of Environmental Pollution Issues: A Review. Highlights of Practical Applications of Heterogeneous Multi-Agent Systems. The PAAMS Collection. S. Ghazi et al. Springer International Publishing, 2014. pp. 13­21. [8] Hlsmann, F., Gerike, R., Kickhfer, B., Nagel, K., & Luz, R. . Towards a multi-agent based modeling approach for air pollutants in urban regions. In Proceedings of the Conference on “Luftqualität an Straßen, 2011. pp. 144-166. [9] Kickhfer, B., Hlsmann, F., Gerike, R., & Nagel, K. Rising car user costs: comparing aggregated and geo-spatial impacts on travel demand and air pollutant emissions. Smart Transport Networks: Decision Making, Sustainability and Market structure, NECTAR Series on Transportation and Communications Networks Research, 2012, pp.180-207. [10] Kickhfer, B., & Nagel, K. Towards high-resolution first-best air pollution tolls. Networks and Spatial Economics, 2013. pp. 1-24. [11] Papaleonidas Andonis, Iliadis Lazaros. Hybrid and Reinforcement Multi Agent Technology for Real Time Air Pollution Monitoring. Artificial Intelligence Applications and Innovations IFIP Advances in Information and Communication Technology Volume 381, 2012, pp. 274-284. [12] Ahat M., Ben Amor S., Bui M., Lamure M., Marie-Françoise Courel. Pollution Modeling and Simulation with Multi-Agent and Pretopology. Complex Sciences, Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering Volume 4, 2009, pp.225-231. [13] Boyd, Robert, Richerson, Peter J. The evolution of reciprocity in sizable groups. Journal of Theoretical Biology, vol. 132, no 3, 1988, pp. 337-356. [14] Szilagyi Miklos N., An Investigation of N-person Prisoners’ Dilemmas. Complex Systems, 14, 2003. pp. 155–174. [15] Szilagyi Miklos N., Somogyi Iren, The Shape of Payoff Functions in N-person Games, Journal of Game Theory. 1(3), 2012,10-14 doi: 10.5923/j.jgt.20120103.02. [16] Xianyu, Bo, and Ping Chen. Does Social Welfare Preference Always Promote Cooperation on Barabási and Albert Networks ?. Computational Economics, 37.3, 2011, pp. 249-266. [17] Barabási, Albert-Lászl, and Réka Albert. Emergence of scaling in random networks. science 286.5439, 1999: 509-512. [18] Conrad Power, A Spatial Agent-Based Model of N-Person Prisoner's Dilemma Cooperation in a Socio-Geographic Community, Journal of Artificial Societies and Social Simulation vol. 12, no. 18, 2009, http://jasss.soc.surrey.ac.uk/ 12/1/8.html. [19] Hardin, Garrett. The tragedy of the commons. Science, 162, 1968. pp. 1243-1248. [20] Osang, Thomas, and Arundhati Nandy. Environmental regulation of polluting firms: Porter’s hypothesis revisited. Brazilian Journal of Business Economics. 3.3, 2003. pp. 128-148. [21] Lange Andreas, , Vogt Carsten, , Cooperation in international environmental negotiations due to a preference for equity, Journal of Public Economics, Volume 87, Issues 9–10, September 2003, pp. 2049–2067. [22] Touza, J., Drechsler, M., Smart, J. C., & Termansen, M. Emergence of cooperative behaviours in the management of mobile ecological resources. Environmental Modelling & Software, 45, 2013. pp. 52-63. [23] De Vries, HJM Bert. Environmental Modelling. In : Principles of Environmental Sciences. Springer Netherlands, 2009. pp. 345-373. [24] Stockie, John M. The mathematics of atmospheric dispersion modeling., Siam Review, 53.2, 2011. pp. 349-372. [25] Hanna, Steven R., Gary A. Briggs, and Rayford P. Hosker Jr. Handbook on atmospheric diffusion. No. DOE/TIC-11223. National Oceanic and Atmospheric Administration, Oak Ridge, TN (USA). Atmospheric Turbulence and Diffusion Lab., 1982. [26] User's guide for the industrial source complex (ISC3) dispersion models volume II -description of model algorithms, U.S. Environmental Protection Agency, 1995, http://www.epa.gov/scram001/userg/regmod/isc3 v2.pdf (visited on 11/22/2014). [27] Behera, Sailesh N. , Sharma, Mukesh. Degradation of SO2, NO2 and NH3 leading to formation of secondary inorganic aerosols: An environmental chamber study. Atmospheric environment, vol. 45, no 24, 2011 pp. 4015­4024. [28] Ghazi, Sabri,, Khadir Med Tarek. Combination of artificial neural network models for air quality predictions for the region of Annaba, Algeria. International Journal of Environmental Studies, 69.1, 2012.pp. 79-89. [29] Pelliccioni, Armando et Tirabassi, Tiziano. Air dispersion model and neural network: A new perspective for integrated models in the simulation of complex situations. Environmental modelling & software, vol. 21, no 4, (2006). pp. 539-546. [30] Khedairia, S., Mohamed Khadir T.. Impact of clustered meteorological parameters on air pollutants concentrations in the region of Annaba, Algeria, Atmospheric Research 113, 2012. pp. 89-101. [31] Chen, S., Cowan, C.F., and Grant, P. M., Orthogonal Least Squares Learning Algorithm for Radial Basis Function Networks, IEEE Transactions on Neural Networks, 2(2), 1991, 302-309. [32] Li An, Modeling human decisions in coupled human and natural systems: Review of agent-based models, Ecological Modelling 229, 2012, 25– 36. [33] Zhao-Han Sheng., Hou, Y. Z., Wang, X. L., & Du, J. G. ,The evolution of cooperation with Informatica 42 (2018) 199–209 209 memory, learning and dynamic preferential selection in spatial prisoner's dilemma game. In Journal of Physics: Conference Series, IOP Publishing, Vol. 96, No. 1, 2008, p. 012107. [34] Tadjine, A., Djebar, H., & Courtois, A. Toxicité des poussieres rejetées par le complexe sidérurgique d’Annaba sur quelques parametres hématologiques du lapin Europeus. Environnement, Risques & Santé, 7(3), 2008, 209-215. [35] Bellifemine, Fabio, POGGI, Agostino, et RIMASSA, Giovanni. JADE–A FIPA-compliant agent framework. In : Proceedings of PAAM. 1999. p. 33. [36] Heaton, J. ,Encog: Library of Interchangeable Machine Learning Models for Java and C#. Journal of Machine Learning Research, 16, 2015., 1243-1247. Fast Artificial Bee Colony for Clustering Abba Suganda Girsang Computer Science Department, BINUS Graduate Program -Master of Computer Science Bina Nusantara University, Jakarta, Indonesia, 11480 E-mail: agirsang@binus.edu Yohan Muliono and Fanny Fanny Computer Science Department, School of Computer Science, Bina Nusantara University Jakarta, Indonesia, 11480 E-mail: ymuliono@binus.edu, fanny.sa@binus.edu Keywords: artificial bee colony, clustering, fast ABC, redundant process Received: February 14, 2017 Artificial Bee Colony (ABC) is one of good heuristic intelligent algorithm to solve optimization problem including clustering. Generally, the heuristic algorithm will take the high computation time to solve optimization problem. Likewise, ABC also consumes too much time to solve clustering problem. This paper intends solving clustering problem using ABC with focusing reduction computation time called FABCC. This idea proposes detecting the pattern of redundant process then compacting it to effective process to diminish the computation process. There are five data sets to be used to prove the performance of FABCC. The results shows that FABCC is effective to prune the duration process up to 46.58 %. Povzetek: Predstavljena je izboljšava v algoritmu Artificial Bee Colony za gručenje, ki dosega na merjenih domenah skoraj 50% pohitritev. Introduction Clustering is the unsupervised classification of patterns (observations, data items, or feature vectors) into groups (clusters). The clustering problem has been addressed in many contexts in many disciplines [1]. Yang and Kamel state as a machine learning perspective that clustering is unsupervised learning because no category labels denoting appropriate partition of the objects are used [2]. Generally, there are two types of data to be clustered; metric as a numerical data and non-metric as not a numerical data [3]. Clustering problem has a broad appeal as one of the steps in exploratory data analysis. Jain et al describe some important applications of clustering algorithms such as image segmentation, object recognition, and information retrieval [1]. Recently, population-based optimization based on behaviour of animal swarm often used as a solution to find an optimization problem such as travelling salesman problem [4]–[6] and clustering [7]–[11]. Artificial bee colony (ABC) algorithm is one of the most recently introduced swarm-based algorithms. ABC simulates the intelligent foraging behaviour of a honey bee swarm [8] to get the optimal solution. ABC has been proved as a good algorithm for solving clustering [8]. Their proposed method can cluster perfectly accurate for Cancer-Int, Iris and wine dataset. This algorithm also shows the good performance comparing with ten algorithms (PSO, BayesNet, MlpANN, RBF, KStar, Bagging, MultiBoost, NBTree, Ridor and VFI). However, compared to other evolutionary algorithms, ABC has a challenging problem. For example, the convergence speed of ABC is slower than the other representative of population-based algorithm [12]. Also, like the general evolutionary algorithm, ABC has many repetition computations before converging solution. However, some of researchers commonly depend on their algorithm to find a best solution only, nevertheless they forget about time needed to fully operate their algorithm. In that case, this research is conducted to focus on the time and try not to stray excess average best solution. Some researchers have attempt solving this problem. Girsang et al proposed BCOPR that is inspired bee colony optimization to gain the fast process to solve travelling salesman problem [13]. Lu et al also used bee swarm for fast clustering [14].This research conducts a fast algorithm using ABC algorithm to solve the cluster problem. The reason why this research focuses on artificial bee colony besides of recent success of clustering data in ABC as stated in [4] [7] [8], because ABC also have many slots to be modified as a fast algorithm. Karaboga used greedy algorithm to be applied in ABC [8] and Zhang [7] used Deb’s Algorithm [15] instead of greedy because Zhang believe that deb’s algorithm is much more simple. The remainder of this paper is organized as follows: Section 2 gives a brief introduction to the clustering problem and artificial bee colony. Section 3 provides a detailed description of the FABCC algorithm, while the performance evaluation of the proposed algorithm is presented in Section 4. Finally, conclusions are offered in Section 5. Related work 2.1 K-means Clustering Clustering algorithm generally is classified into two big parts, first one known as hierarchical clustering and second one is partition clustering [16]. Hierarchical clustering group data objects with a sequence of partitions. Hierarchical procedures divided into two segments which are agglomerative and divisive. Where agglomerative approach begins with each pattern in distinct cluster (single cluster) and then will be merged later. Divisive pattern is vice versa, begin with a single cluster and will be divided later [1]. Partitional clustering algorithm obtains a single partition of data instead of a clustering structure. This technique usually produce clusters by optimizing a criterion function defined either locally or globally. This research will be using partition, because we already know the total of the clusters. And we start with the number of clusters without reducing or adding it. Among such a varies clustering formulations that are based on minimizing a formal objective function, the most widely used and studied is k-means clustering [17]. Clustering based on k-means is closely related to a number of other clustering and location problems. These include the Euclidean k-medians in which the objective is to minimize the distance to the nearest centre of the centroid. The aim of the K-means algorithm is to divide M points in N dimensions into K clusters so that the within-cluster sum of squares is minimized. It is not practical to require that the solution has minimal sum of squares against all partitions, except when M, N are small and K = 2. We seek instead "local" optima, solutions such that no movement of a point from one cluster to another will reduce the within-cluster sum of squares[18]. This research will do a similar algorithm to k-means, modified the algorithm a little by adding greedy algorithm to swap the route of the clusters. 2.2 Artificial Bee Colony Artificial Bee Colony (ABC) Algorithm is one of the swarm intelligence, whereas its copy the mechanism of honey bee swarm’s intelligence to find the food source [19]. Originally, ABC optimization was proposed for solving numerical problems [20]. Therefore, the first studies aimed to evaluate the performance of ABC on the widely used set of numerical benchmark test functions and to compare it with that of well-known evolutionary algorithms such as Genetic Algorithm, Particle Swarm Optimization and Ant Colony Optimization. [19] There are 3 types of bee. The first one is employed bees which search for a food source. The food source value depends on many factors, such as its proximity to the nest, richness or concentration of energy, and the ease of extracting this energy [21]. Employed Bee will do a waggle dance later. The more the food source, the longer the waggle dance will last. The waggle dance represent the fitness value. Second is onlooker bees which wait employed bee to do waggle dance and choose randomly according to how much fitness value of the employed bees. The last one is A.S. Girsang et al. scout bees which looking for the food source without pattern. The position of the food source represents a solution that can be made by the bees. And the amount of the nectar represents a better solution that can be found by the bees. Whereas in [19] Karaboga has proved that ABC can be used to optimize multivariable functions and ABC outperforms the other swarm intelligence algorithm such as Genetic Algorithm, Particle Swarm Algorithm and Particle Swarm Inspired Evolutionary Algorithm (PS-EA) [19]. The main steps of Artificial Bee Colony algorithm are: The main steps of artificial bee colony algorithm are: Step 1: Initialize the population of solutions randomly and evaluate them. Step 2: Produce new solutions for each employed bees, evaluate them, and apply the greedy solutions for them and the greedy selection process. Step 3: Calculate the probabilities of current sources (employed bees) with which they are preferred by the onlookers. Step 4: Assign onlooker bees to employed bees according to probabilities. Step 5: Produce new solutions for each onlooker bees, evaluate them, and apply the greedy selection process. Step 6: Stop the exploitation process of the sources abandoned by bees and send the scouts in the search area for discovering new food sources, randomly. Step 7: Memorize the best food source found so far. Step 8: If the termination condition is still not met, repeat the algorithm process from Step 2, otherwise stop the algorithm. As the development of Karaboga’s artificial bee algorithm, Zhang contributes some additional step in Karaboga’s algorithm, so it can be used for solve clustering problem. Zhang appends control parameter to his algorithm and he also have some different step with Karaboga’s algorithm. Zhang adds the control parameter to scout phase which called upper bound that uses as the limit of scout number and in his algorithm, the scout phase will be different withemployedbeesphase.InKaraboga’s algorithm, the scout only finds once the random food source and act like employed bees as well. However in Zhang’s algorithm, the scout will act as scout that always find the new food source randomly as long as it is scout. The scout will only change into employed bees if the limit of scout number is reached, where the worst bee will still be scout, and the others will be employed bees. The steps of Zhang’s artificial bee colony algorithm for clustering are: Step 1: Initialize the population of solutions and its control parameter. Order the first half of colony consists of the employed bees and the second half includes onlooker bees. Generate random position for each employed bees and evaluate it. Set scout number to zero. Step 2: If the number of scouts is more than its upper bound, order the first half of colony, make the bees with worst solution quality as scouts and others as employed bees. Update the scout the employed bee forgets its memory and become number. a scout. The scout number is adding by Step 3: Produce new solutions for each employed bees, Step 4: Send each scout into the search area for evaluate them, and apply the Deb’s selection discovering new food sources randomly. When a process. If the limit for abandonment is reached, new food is found, evaluate it, and apply the Deb’s selection process. Figure 1: Greedy Selection Process. Figure 2: Comparison of ABC algorithm between (a) Karaboga’s ABC (b) Zhang’s ABC for clustering and (c) proposed method. Step 5: Calculate the probability value of the current food sources with which employed bees are preferred by the onlookers. Step 6: Produce new solutions for each onlooker bees, evaluate them, and apply the Deb’s selection process to update the corresponding employed bee’s memory or the current food sources. Step 7: For each employed bee and scout, if its memorized position is better than the previous achieved best position, then the best position is replaced by it. If the termination condition is still not met, repeat the algorithm process from Step 2, otherwise stop the algorithm. 3 Proposed method 3.1 The concept This section firstly is described the combination of Karaboga [8] and Zhang [7] algorithm. Most of steps uses A.S. Girsang et al. the Zhang’s algorithm, except selection process uses Karaboga’s algorithm with greedy algorithm. The greedy algorithm is the key to find the best solution. So, for the next experiment, the centroid or food source calculation is reconstructed, after the greedy algorithm has been done, whereas the employed bees compared its food source to another bee’s food source as shown on Fig.1. Besides of the greedy algorithm, the proposed metod, FABCC, also combines Zhang’s algorithm and Karaboga’s algorithm in scout phase. Unlike Karaboga’s algorithm, bee on FABCC mimics Zhang’s algorithm that the sequence process of bee is employed bee, scout bee, and then onlooker bee as shown in Fig. 2. However FABCC adopts Karaboga’s algorithm calculation to determine fitness value. 3.2 ABCC and FABCC From the literature studies in Section 2, it mentions there are three bees in this algorithm, consisting of employed Figure 3: ABC for clustering algorithm flowchart. Figure 4: Comparison of employed bee / onlooker bee phase between (a) ABCC phase and (b) FABCC phase. bee, onlooker bee, and scout bee. Which are each of them has their own fitness value. The process of searching the best fitness value consisting of eight steps and the flowchart is shown in Fig. 3. Step 1: Initialize the population of solutions and its control parameter. Order the first half of colony consists of the employed bees and the second half includes onlooker bees. Generate random position for each employed bees and evaluate it. Set scout number to zero. Step 2: If the number of scouts is more than its upper bound, order the first half of colony, make the bees with worst solution quality as scouts and others as employed bees. Update the scout number. Step 3: Produce new solutions for each employed bees, evaluate them, and apply the greedy selection process. Step 4: If the limit for abandonment is reached, the employed bee forgets its memory and become a scout for discover a search space and get the new food source randomly, and scout act like employed bees. The scout number is adding by one. Step 5: Calculate the probability value of the current food sources with which employed bees are preferred by the onlookers. Step 6: Produce new solutions for each onlooker bees, evaluate them, and apply the greedy selection process. Step 7: Compare employed bees and onlooker bees which have same food source and save the best quality for each food source. Step 8: Memorize the best food source quality overall. If the termination condition is still not met, repeat the algorithm process from Step 2, otherwise stop the algorithm. In the experiment using ABCC Algorithm for big data that needs many iteration, ABCC takes too much computation time. This section explained a proposed method for Fast Artificial Bee algorithm to reduce the computation time. In Artificial Bee Colony algorithm, employed bees and onlooker evaluation phase. The comparison of employed or onlooker bee and its random neighbor is compared from their fitness value regardless the compared data is similar or not (Fig. 4.a). However in FABCC algorithm, one step is added for check the compared data. If the compared data of employed or onlooker bee and its neighbor is similar, the calculation of fitness value is skipped and the bee’s data still same with the current data (Fig. 4.b). 4 Experimental results The parameters used in fast ABC for clustering (FABCC) are shown in Table 1. The description of those parameters is as follows. a. The number of bee is 20 which is grouped into 3 types of bee. Figure 5: Duration time process of ABCC and FABCC. b. The number of initial employed bee and the number of onlooker are 10. c. The chance of searching better condition for employed bees means that if employed bees failed to gain better solution after 100 times consecutively, it will leave its pattern and become a scout bee to search another pattern. d. The maximum cycle of number selected is 2000 Parameter Value Total Bees 20 Employed Bees 10 Scout Up to 5 Onlooker Bees 10 Limit for Abandonment 100 Maximum Cycle Number 2000 Table 1: Parameter used for experiment. This research uses several data sets to be evaluated as shown in Table 2. This research focuses in two aspects, quality and processing time. The quality of the program can be evalueted from the result of fitness value. Fig. 5 shows the different computation time of original algorithm for clustering (ABCC) and modified algorithm for clustering (FABCC) for five data sets. In every data sets, the figures show that the differences of ABCC and FABCC starting in 300th iteration. It means there is no siginifance 1-200 iterations From 1st iteration computation time. Data Set Number Number Number of Patterns of Clusters of Attributes Iris 150 3 4 Wine 178 3 13 Haberman 306 2 3 Connectionist 208 2 60 Bench (Sonar) Parkinson 195 2 22 Table 2: Data sets. It can be divergent because in beginning the bees still generate some various pattern to be learned. After through some iterations, the preferred pattern will be created. Bees learn from the previous pattern.The pattern which has same with previous patterns in many times is identified as the repition process. Therefore, it can be pointed as the pruned pattern to prevent the redundant computation. For more detail, Fig. 6 shows that the computation time in FABCC tends to decrease in each iteration after several Figure 6: Computation time per iteration for ABCC and FABCC. iteration while computation time in ABCC constant since the first iteration. In FABCC, the computation time could decrease because the bee in FABCC will learn the pattern of the data. When the algorithm generates patterns, bee will truncate some calculation. Therfore the computation time decrease after many iterations. The results of parametes used in Table 1 are shown in Fig.7 and Table 3. Fig. 7 shows the convergent ABCC and FABCC. They start convergent after about 800 iteration. Table 3 shows the results of proposed algortithm, FABCC which is compared original algorithm, ABCC algorithm. The results are categorized two parts, computation time and quality (fitness value). This evaluations are run 30 times for each parts. Each of parts consists some test statistics such as mean, min, max, and standard deviation (SD). Min is considered as best solution while max is considered as the worst solution. All standard deviation (SD) of the results are too small comparing mean (less than 1 %). The small deviation indicates is almost same to the expected value. This study also conducted the Wilcoxon signed-rank test. The wilcoxon test is used to analyze the results of paired observations from two data (in this case results ABCC and FABCC) are different or not. The bound significant (.) is used less than 0.01. If Asymp. Sig < ., it indicated that these two related samples (FABCC and ABCC) are different significantly. Table 3 also shows that the difference of ABCC’s and FABCC’s fitness value are not significant. In best case, FABCC can achive as the rescults of ABCC in all of data sets. The FABCC is only a little outperform ABCC in mean, and worst value. The differences are only is less than 1 %. However the computation time in FABCC can be decrease significant comparing the ABCC. They can be different about 30-50 %. This means FABCC can be applied to reduce computation time as the problem of ABC as one of the heuristic algorithm. Table 3 also shows that the difference of ABCC’s and FABCC’s fitness value are not significant. In best case, FABCC can achive as the rescults of ABCC in all of data sets. The FABCC is only a little outperform ABCC in mean, and worst value. The differences are only is less than 1 %. However the computation time in FABCC can be decrease significant comparing the ABCC. They can be different about 30­50%. This means FABCC can be applied to reduce computation time as the problem of ABC as one of the heuristic algorithm. Data Set ABCC (Computation Time) FABCC (Computation Time) Asymp. Sig. (2­tailed) ABCC­FABCC Min (best) Max (worst) Mean SD Min (best) Max (worst) Mean SD Iris 53.76 54.01 53.94 0.001 37.08 38.21 37.62 0.001 0.002 Wine 74.57 75.23 74.80 0.002 52.18 56.17 54.56 0.002 0.003 Haberman 100.76 101.2 100.91 0.002 46.42 59.22 53.91 0.012 0.005 Sonar 159.99 160.35 160.11 0.001 111.78 126.83 118.15 0.011 0.009 Parkinson 93.76 95.64 94.05 0.002 48.54 63.95 56.05 0.009 0.007 ABCC (Fitness Value) FABCC (Fitness Value) Iris 78.94 78.94 78.94 0.000 78.94 79.11 79.02 0.001 0.001 Wine 2370700 2370700 2370700 0.000 2370700 2370700 2370700 0.000 0.000 Haberman 30507 30507 30507 0.000 30507 30524 30513.4 0.004 0.002 Sonar 280.53 280.61 280.57 0.001 280.53 280.71 280.62 0.002 0.000 Parkinson 1343400 1343400 1343400 0.000 1343400 1343981 1343710 0.008 0.001 Table 3: Results of experiment. Figure 7: Fitness value of ABCC and FABCC. 5 Conclusion This research uses bee colony algorithm by combining Zhangs and Karaboga algorithm. The choice of ABC algorithm sequence is based on Zhang and fitness value calculation is based on the original Karaboga. This proposed method, FABCC, also concludes that some redundant process occurs on bee colony algorithm for clustering. The redundant process identifies as a pattern that is able to compress. The result shows FABCC is effective to reduce computation time. It can be proved by conducts five datasets Iris, Wine, Haberman, Sonar, and Parkinson. The results shows that it can reduce 30-50% of computation time, while the fitness value only reduce less than 1%. This study focuses on only small five datasets for clustering. It can be extended using the other big datasets. It might be have the different characteristic. However in our exploration, the redundant process always occurs in most of metaheuristic algorithm. In next research, researchers can put their effort to remake calculation of fitness value, in calculating in fitness value there are many iterations and redundant calculation to be observed to prune the redundant pattern. 6 References [1] a. K. Jain, M. N. Murty, and P. J. Flynn, “Data clustering: a review,” ACM Comput. Surv., vol. 31, no. 3, pp. 264–323, 1999. [2] Y. Yang and M. Kamel, “Clustering ensemble using swarm intelligence,” Swarm Intell. Symp. 2003. SIS ’03. Proc. 2003 IEEE, pp. 65–71, 2003. [3] T.Zhang,R.Ramakrishnan,andM.Livny,“BIRCH: An Efficient Data Clustering Method for Very Large Databases,” Proc. 1996 ACM SIGMOD Int. Conf. Manag. Data, vol. 1, pp. 103–114, 1996. [4] B. Akay and D. Karaboga, “A modified Artificial Bee Colony algorithm for real-parameter optimization,” Inf. Sci. (Ny)., vol. 192, pp. 120–142, 2012. [5] A. Ouaarab, B. Ahiod, and X.-S. Yang, “Discrete cuckoo search algorithm for the travelling salesman problem,” Neural Comput. Appl., vol. 24, no. 7–8, pp. 1659–1669, 2013. [6] S.-M. Chen and C.-Y. Chien, “Solving the traveling salesman problem based on the genetic simulated annealing ant colony system with particle swarm optimizationtechniques,” Expert Syst. Appl., vol. 38, no. 12, pp. 14439–14450, 2011. [7] C. Zhang, D.Ouyang, andJ. Ning, “Anartificial bee colony approach for clustering,” Expert Syst. Appl., vol. 37, no. 7, pp. 4761–4767, 2010. [8] D. Karaboga and C. Ozturk, “A novel clustering approach: Artificial Bee Colony (ABC) algorithm,” Appl. Soft Comput., vol. 11, no. 1, pp. 652–657, 2011. [9] S. Goel, A. Sharma, and P. Bedi, “Cuckoo Search Clustering Algorithm: A novel strategy of biomimicry,” 2011 World Congr. Inf. Commun. Technol., pp. 916–921, 2011. Informatica 42 (2018) 211–219 219 [10] S. Rana, S. Jasola, and R. Kumar, “A review on particle swarm optimization algorithms and their applications to data clustering,” Artif. Intell. Rev., vol. 35, no. 3, pp. 211–222, 2011. [11] C.-L. Huang, W.-C. Huang, H.-Y. Chang, Y.-C. Yeh, and C.-Y. Tsai, “Hybridization strategies for continuous ant colony optimization and particle swarm optimization applied to data clustering,” Appl. Soft Comput., vol. 13, no. 9, pp. 3864–3872, 2013. [12] R. Storn and K. Price, “Differential evolution–a simple and efficient heuristic for global optimization over continuous spaces,” J. Glob. Optim., pp. 341– 359, 1997. [13] A. S. Girsang, C.-W. Tsai, and C.-S. Yang, “A Fast Bee Colony Optimization for Traveling Salesman Problem,” 2012 Third Int. Conf. Innov. Bio-Inspired Comput. Appl., vol. 1, no. c, pp. 7–12, 2012. [14] Y. Lu, S. Lu, F. Fotouhi, Y. Deng, and S. Brown, “FGKA: A fast genetic k-means clustering algorithm,” Proc. 2004 ACM …, pp. 1–2, 2004. [15] K. D. David E. Goldberg, “A comparative analysis of selection schemes used ingenetic algorithms.” [16] J. HanandM.Kamber, “DataMining: Concepts and Techniques,” Ann. Phys. (N. Y)., vol. 54, p. 770, 2006. [17] T. Kanungo, D. M. Mount, N. S. Netanyahu, C. D. Piatko, R. Silverman, and a. Y. Wu, “Anefficient k-means clustering algorithm: analysis and implementation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 24, no. 7, pp. 881–892, 2002. [18] J. a HartiganandM. a Wong, “Algorithm AS 136: A k-means clustering algorithm,” Appl. Stat., vol. 28, no. 1, pp. 100–108, 1979. [19] D. Karaboga and B. Basturk, “A powerful and efficient algorithm for numerical function optimization: artificial bee colony (ABC) algorithm,” J. Glob. Optim., vol. 39, no. 3, pp. 459– 471, 2007. [20] D. Karaboga, B. Gorkemli, C. Ozturk, and N. Karaboga, “A comprehensive survey: artificial bee colony (ABC) algorithm and applications,” Artif. Intell. Rev., vol. 42, no. 1, pp. 21–57, 2014. [21] D. Karaboga and B. Basturk, “On the performance ofartificial bee colony(ABC) algorithm,” Appl. Soft Comput., vol. 8, no. 1, pp. 687–697, 2008. A Pairing Free Secure Identity-based Aggregate Signature Scheme under Random Oracle Eman Abouelkheir Electrical Department, Faculty of Engineering, Kafr Elsheikh University, Egypt E-mail: emanabouelkhair@eng.kfs.edu.eg Jolanda G. Tromp Department of Computer Science, State University of New York Oswego, USA E-mail: jolanda.tromp@oswego.edu Keywords: information security, aggregate signature, without parings, security proof, random oracle Received: November 15, 2016 The signature aggregation is efficient for the communication links as the time complexity is independent of n different users. The bilinear pairing requires super-singular elliptic curve groups that have a spacious range of elements. Also, the point multiplication over elliptic curve is less computational cost than the pairings, therefore, the pairing-based schemes expose more computational complexity than schemes that without pairings. This paper introduces a new efficient and secure pairing free signature scheme based on the idea of aggregation. Also, the proposed scheme without pairings offers lower computational cost than other schemes from pairings as it saves 68.69% from computations. Povzetek: Ta prispevek se ukvarja s kriptografskimi algoritmi, konkretno s shemo digitalnih podpisov. Opisana je izboljšava obstoječega algoritma, ki dosega pohitritev za dve tretjini, hkrati ostaja varna. Introduction Cryptography has two primitive issues; confidentiality and authentication. Digital signature achieves the authentication issue. Also, for efficient communication links, schemes should provide low time complexity. Moreover, low time complexity is useful for battery and bandwidth saving of the channel in networks [1]. There are many cryptographic algorithms provide privacy, such as signature schemes, authentication schemes, and encryption schemes [2]. For providing privacy and anonymity to a user, these schemes have to be properly combined. Schemes and methods such as group signature schemes, blind signatures, aggregate signatures, zero-knowledge proof methods, homomorphic encryption schemes offer several useful privacy-enhancing properties, e.g. identity hiding, binding information, data confidentiality, unlinkability, intracebility, etc. Recently, many applications and services require privacy protection over communication systems. The current secure communication systems support authentication, data integrity, and non-repudiation. But, the communication systems users and providers can need different security properties that are out of basic security properties. These advanced properties are usually connected with user privacy. The following text summarizes the advanced security properties and requirements. • Privacy/Anonymity -privacy protection is ensured for every user in the system who follows the rules. Users can communicate anonymously. Their identities can be revealed only in special cases, e.g. when a user breaks a rule, authority order, police order, emergency events etc. Two types of privacy protection can be distinguished: a basic anonymity to protect a user identity against passive attacks and a full anonymity to protect also against active attacks [3]. Signatures needed when an attacker gets access to all old messages. When the unlinkability property ensured, then the attacker is not able to connect certain signatures together. . Responsibility/revocation -every user, has to be revealed and revoked using the certain key when breaking the rules of a system. The revocation assures that the revoked user has no rights in whole systems afterward. The revocation helps protect the system against repeated misusing. In some applications, the traceability of malicious users’ messages is demanded. . Efficient and secure key management -key exchange, revocation, and establishment in systems have to be efficient computationally/memory low cost and secure. In privacy-preserving solutions, key management has to keep user privacy. . Efficient and secure execution of cryptographic protocols -the phases of a cryptographic protocol should be as efficient as possible to minimize the negative influence of a system, especially, if the restricted devices have been deployed. . Exculpability -neither revocation or key manager, can be able to generate a valid signature behalf another user who hold trace keys. The user cannot be accused that makes signature which he does not make. This property is mainly needed in group signature schemes, i.e exculpability that is ensured in [4] by Boneh, Boyen and Shacham. The aggregate signature idea arises from those di.erent signers aggregated into a concise aggregate signature on different documents[5]. Using the aggregate signatures instead of using n signature by n different users in many application such as key distribution in PKI reducing the communication overhead and offer efficient computational cost. Another example, in router securing system, each router need to sign its part of a route in the communication link then transmits all the signatures to the following router. Without the aggregation concept, transmitting the different signatures exposes high communication overhead[6]. The aggregate signature could be used instead of individual signs for this goal. Recently, there are two signature schemes are proposed. The first one [5] provides .exible aggregation based on pairings. The second [7] provides only sequential aggregation using certified trapdoor permutations. For the schemes in [5,7], the authors proposed aggregates signature schemes which size is independent of n users. Specifying individual signers by some public information needed for public verification. Aggregate signature schemes that specify the signers with their public information may be similar as the traditional signatures and both are not efficient. Thus, specifying the signers with their identities is more useful than specifying them by their public keys. Cryptography from pairings has many prime properties. It is supposed that Pairing-based cryptography with smaller parameters can present a desired security level as the general elliptic curve cryptography. Suppose that there is an elliptic curve E has elements defined over . But the pairing-based cryptography is working with the functions and elements defined over , where k is some random and chosen to be secure. Either the elliptic curve hard problem (ECDLP) defined over or the discrete logarithm problem (DLP) defined over are basic problems that the cryptography from pairings security depends on [8]. Because of the previous clarification, this paper introduces a new pairing-free signature aggregation scheme based on the general elliptic curve cryptosystem depends on signers identities. The idea behind the identity-based cryptography (IBC) [9], to use some information belongs to a signer ( such as an email ID ) as a user public key rather than using public-key and certi.cate management. Therefore, the IBC system requires Private Key Generator (PKG) that is called a trusted third party, that generates the private keys for all identities based on its master key and the signer identity. Boneh and Franklin [10] and Cocks [11] propose many identity-based encryption schemes. Also, old schemes in [9,12,13]; recent schemes and analyses include [14,15,16,17]. The concept of signature aggregation allows different signers to sign different messages. This leads to E. Abouelkheir et al. efficient communication and fewer computations. In any aggregate signature scheme n different signatures are considered as one single signature. The aggregation approach can be used instead of using public key certificates to satisfy efficient communication and computations. Aggregate signatures have many applications such as mutual authentication between vehicles in VANETs and in wireless sensor network. The goal of my paper to introduce a new secure pairing-free aggregate signature scheme. The proposed scheme security is proven in the random oracle and assuming a hard Diffie-Hellman problem. Also, the proposed scheme saves the time complexity by 68.69% . The new aggregate signature and its analysis is the modified version of the scheme in [18]. The rest of the paper is organized as follow : section two presents the digital signature approach versus the water marking. Also, section three describes preliminaries. Then section four introduces the generic model for the proposed scheme. Moreover, section five presents the security requirements of any aggregate signatures based on user’s identities. In section six and seven, the proposed scheme is presented with the formal security proof under random oracle respectively. In section eight, the results and discussion are introduced. The proposed scheme is compared with other in literature in section nine. Section ten concludes the proposed work. Finally, the future scope introduced in section eleven. 2 Digital signing versus watermarking A digital signing is an approach of cryptography used for securing the communication links. The goal of the signature to verify the end to end communication system users. The digital signing operation is similar to the handwritten signing operation and exactly as a paper signature. It used to verify the identity of a user using its digital certificate. This paper is concerned with the digital signature approach. The goal of watermarking to hide the information into a digital signal that provides a copyright protection in a digital format[19]. Many watermarking schemes have been proposed. In 2012, Nilanjan Dey, Poulami Das, Sheila Sinha Chaudhuri, and Achintya Das, [20] used Alattar's method efficiently for watermark insertion and extraction for an EEG signal. In 2013, K. P. Arijit, D. Nilanjan, S. Sourav, D. Achintya, and S. Ch. Sheli [21] proposed a new technique for reversible watermarking is used for the color image .In 2014, Nilanjan Dey, Goutam Dey, Sheli Sinha Chaudhuri, and Sayn Chakraborty [22] proposed two novel blind-watermarking mechanisms are; 1-session key based blind-watermarking mechanism and 2-self-recovery based blind-watermarking mechanism, into the Electromyogram (EMG) signal. In 2015, Nilanjan Dey, Monalisa Dey , Sainik Kumar Mahanta ,and Achintya Das [23] proposed a technique is to prevent any modifications in a transmitted biomedical ECG signal. In 2016, Y. B. Amar, I. Trabelsi, D. Nilanjan and S. Bouhlel [24] proposed watermarking scheme used for copyright protection purposes. Preliminary 3.1 Bilinear pairing Suppose is a cyclic group has an order q, q is prime. This group is generated by the point P over an elliptic curve E and defined over the prime field . Let be a pairing where . For any (points over an elliptic curve E) and , are integers. The pairings satisfy the following properties: -linearity: . -NonDegenerate: . -Easy to compute : it must be easy and efficient computed. 3.2 Elliptic Curve Cryptography(ECC) ECC is an public key cryptography approach based on the mathematics of elliptic curves. ECC is faster than RSA and uses smaller keys, but still, provides the same level of security . Suppose are the set of points over the elliptic curve that defined over the prime field , defined by and must satisfy the equation . The cyclic group , is an elliptic curve additive group. The group identity element in is ; the infinity point; The scalar multiplication on defined as . For some integer , a point P of order n satisfy . ECC was proposed in 1985, by Miller [25] and Koblitz [26]. When comparing ECC with other public key cryptosystems, it was found that ECC-based public key cryptosystem has many advantages such as low computation cost, smaller key size, low storage space cost etc. It is known that the discrete logarithm problem based on ECC (ECDLP) of any elliptic curve element that has a public point known base point, is harder than the discrete logarithm problem (DLP) over the finite field Security is not the only cryptography objective goal but also, there are many factors as the problems associated with key management and protection, hash functions , defective use of random generators, and the incompact private key software. The ECC implementation issues are [27]: . Used in Diffie Hellman cryptosystem and also, digital signing approach. . There are many available standardized elliptic curves approved by NIST for the multiple security requirements. Informatica 42 (2018) 221–228 223 . Elliptic curves cryptosystems enable comprehensive information on algorithms. The Benefits of elliptic curve based cryptosystems over RSA cryptosystem: . The elliptic curve based cryptosystem key takes significantly less memory for the same security level. Table I indicates the key size for RSA and ECC for the same security level. . Smaller key size in ECC leads to faster digital signature generation and therefore saving resources. In the other hand, ECC has disadvantages versus RSA crypto system. It is complicated in mathematical backgrounds. NIST guidelines for key size of ECC , RSA, and AES ECC RSA Ratio AES 163 1024 1:6 256 3072 1:12 128 384 7680 1:20 192 512 15360 1:30 256 Table 1: Security level of various key sizes in ECC and RSA. 3.3 Computational problems Here, a brie.y review of some mathematical problems: Definition 3.3.1. Suppose g be a group generator of the group G where . The CDH related to g is how to compute given by , . Definition.3.3.2. (Computational Diffie-Hellman (CDH) Problem) over an elliptic curve. Given P point over an elliptic curve and then for known , it is hard to compute of over the group Definition.3.3.3. (Computational Diffie-Hellman (CDH) Assumption). Let A be an adversary able to break the CDH problem with a trivial probability, if given the tuple of CDH problem where , then A could solve the CDH with the trivial advantage . 4 Aggregate signature model An identity-based aggregate signature scheme model has composed six algorithms: . Setup phase: with input k; the security parameter; the public key generator (PKG) generates the master and private keys and the system parameters . Finally, the publishes , and keeps secret. . Key Extract: PKG runs this algorithm using the signer identity ;delivered by the signer , and as an input. The output is the signer private key and the PKG sends the signer private key via secure channel to the user . . Sign: this algorithm takes the user identity , his private key , message and as input to create a valid signature on by the signer . . Aggregate: this algorithm takes as an input, any third party can generate the signature aggregation for all the messages with their identities . . Signature Verification: with input the user performs two checking operation first; whether by is a valid signature on and outputs “Valid” if true otherwise , “Invalid”. Second; with input and checks the validity of the aggregate signature on and outputs “Valid” if true otherwise , “Invalid”. Security algorithm 5.1 Unforgeability The proposed scheme security model follows the scheme proposed by [18] with slight variations. The security model follows a game with three phases: setup, training and forgery phase. Two attacks in this security model are considered; adaptive chosen message and identity attacks. Thus, the scheme is secure under those attacks against any forgery. if the adversary A has not a significant advantage in any probabilistic time algorithm in this game : -Setup: by executing this algorithm the challenger obtains the parameters and the and deliver the to the adversary . -Training: A query the following oracle after the setup algorithm: . Extract oracle: With makes a query and obtains the private key with and deliver it to . Signing oracle: A queries the signing oracle with , then generates a valid signature on . -Forgery: A generates an aggregate signature on for with input in which at least target identity . The adversary A forge the signature if there is a valid ........ for a pair with the advantage: E. Abouelkheir et al. 6 The proposed scheme 6.1 Setup . In this phase, the PKG selects three additive groups of order q ( prime number) where , k is the security parameter. Then the PKG selects two pairs of integers satisfying . Also, the PKG selects a generator point P of on the elliptic curve E defined by over the finite field and chooses a the following hash functions , , and . Then, the PKG randomly picks up , s is and calculates the . The PKG keeps secrete and the public. 6.2 Key extraction This algorithm follows the following steps: 1. Picks up randomly and calculates 2. Computes , for the all users ,. 3. The PKG sends the corresponding secrete key and the public key to the users through a secrete channel 6.3 Signing With input : 1. Selects a random number and calculates: 2. Computes : , and 3. Computes: , . The signature of on message is 6.4 Aggregate. On input a set of signatures , with the identity , , are the signatures of the messages : , , The aggregate signature will be 6.5 Signature verification. With the input from any user can verify this signature. The verification process as follow: . Computes to recover . Checks if the following equations holds: and 6.6 Proof of correctness The proposed scheme security proof The security proof demonstrates that ECDLP could be solved without significant probability . Also, An adversary A may forge this scheme without significant probability against chosen message and identity attacks 7.1 Theorem1. The signature scheme is secure against chosen message and identity attacks if there is an adversary A with a polynomially bounded query for , , , and who can forge the proposed scheme with a non-negligible advantage , C may forge the signature (1) Proof: a) Setup The challenger C selects a group with a generator point P. Then, C randomly selects .......* and calculate . C obtains the following four hash oracles: Informatica 42 (2018) 221–228 225 then deliver the public to A A asks C for different queries as follow: b) query o Firstly, C delivers the system parameters to A, then C with input , selects q randomly and returns it to A. o In another case, A might know the public component X that corresponds to an identity ID. When A makes a query for ID, there are two cases: . In the case of : , the challenger C suits , computes , is anonymous, C wants to solve the ECDLP for , as it is part of ECDLP. After this, C stores in . . If , C selects randomly, sets , delivers to the signer such that and stores c) Extract query When queries for the private key of , does the following o checks the to verify whether or not there is an entry for . If does not contain an entry for , return o Otherwise, if the entry corresponding to in is of the form and returns , if then .. recovers the tuple from and returns and compute then returns to . d) query When , is submitted to queries for the first time C returns checks of whether the tuples in C returns , otherwise C chooses a new random includes to the then C returns e) H2 queries When , is submitted to queries the first time C returns checks of whether the tuples in ,C returns , otherwise C chooses a new random includes to the then C returns f) Sign Oracle For each new query , C proceeds as follows: . If , C signs a message m as follows: . If the public key of has been replaced: 1) Obtains by calling query oracle on 2) Selects randomly, calculates: . 3) Computes: by calling query on input , and by calling on input , and Obtains the secrete key d from the extract query and computes: , . The signature of ID on message m is . Otherwise, C signs m in the usual manner by using (obtained from the query) and (obtained from extract query) . If , C does the following: 1) Generates a random 2) Sets 3) Computes: , and 4) Updates the lists and respectively with the following tuples and . Generate a different then repeat steps 3 and 4 if any entry in the list or is similar as the tuples generated. 5) C returns the signature on by . Note the generated signature is valid due to: This shows that will able to be a valid signature to the adversary A. 7.2 Forgery phase 7.2.1 Lemma 1 After the adversary A generate on the message by user identities . A can generate a valid with probability if there exists where . The algorithm could be flunk in the following places : -For the extract oracle if the adversary queries for the then the algorithm flunks. If is the maximum extract queries number made by the adversary. The probability of non-querying for the extract phase is: (2) where is the queries maximum number . A may success if or if the adversary A make a query for the signing oracle on with user identity . This happen if: (3) From the previous probabilities, A can break the scheme under adaptive chosen message and identity attack with the advantage: (4) The adversary A may generate a valid aggregate signature without signer secrete key with the probability (5). 7.2.2 Lemma 2 A made queries for , , , , query as the previous queries. A may generate a valid aggregate signature with probability for n users. C computes as same as the previous, and then generate a valid signature . Using two valid signatures C does the following: Thus C knows all the private keys multiplied the point P over the elliptic curve by . Also, C knows by multiplying the final equation by , but C cannot get unless solving the ECDLP and it is hard under the assumption (ECDLP). C (6) By this, the proposed identity based aggregate signature over is secure against any forgery with a non­ significant probability . Under this assumption C might solve the ECDLP 8 Results and discussion When analyzing time complexity of the proposed scheme, it is found that it consumes only two point multiplication over elliptic curve in an individual signing process. Through the verification process, the proposed scheme consumes two point multiplication, one modular inverse operation and two point addition over the elliptic curve. All the computations are relative to the modular multiplication process. The proposed scheme consumes 127.84 in one individual complete signing and verification process 9 Comparative study This section shows the comparative study between the proposed signature scheme without pairing with the scheme with pairings in [28] in the case of individual signing. The computations are all relative to the modular multiplication. Table II indicates the definitions for the cryptographic operations. Notation Description The time complexity needed to execute the modular multiplication The time complexity needed to execute elliptic curve scalar point multiplication, The time complexity needed to execute the pairings operation, The time complexity needed to execute pairing-based exponentiation, The time complexity needed to execute the point addition over elliptic curve, The time complexity needed to execute the modular inversion operation, Table 2: Definition of different cryptographic operations. The scheme in [28] uses the identity-based signature from pairings. Craig and Zul.kar scheme consumes 406.24 in an individual signing operation while, the proposed pairing free scheme consumes 127.84 in an individual operation and therefore the proposed scheme shows lower time complexity than in [28], as it saves 68.69% from the computations as in table III. 10 Conclusion This paper introduces a new aggregate signature scheme without pairings. It saves 68.69% of computational cost than another scheme in [28] in pairings. The security proof of the proposed scheme shows that it is secure in random oracle model. The aggregate signature schemes are very useful when needing authentication in vehicular ad hoc network and e-commerce applications. 11 Future scope The idea of the aggregate signature used in securing the communication networks such as vehicular area network VANETs and Mobile area networks MANETs. Also , aggregate signature used in the e-commerce applications. The proposed scheme should be used in VANETs to provide aggregate authentication with low computational cost. Signature Verification Tota l ( in ) Crai g, and Zul. kar 4 - - - 1 3 - 2 406. 24 IDB­ASC 2 - - - 2 - 1 2 127. 84 Table 3: Comparison of computational cost. 12 References [1] K. C. Barr and K. Asanovic. Energy Aware Lossless Data Compression. In Proceeding of Mobisya 2005 [2] A. Prace. Privacy Preserving Cryptographic Protocols For Secure Heterogeneous Networks. Thesis of Doctoral, 2014 [3] X. Boyen and B. Waters. Full Domain Subgroup Hiding and Constant Size Group Signatures In Public Key Cryptography. Lecture notes in computer science, vol. 4450, pp.1-15, 2007 [4] D. Boneh, X. Boyen and H. Shacham. Short Group Signatures. In Advances in Cryptology-CRYPTO 2004, Springer , pp. 227-242. [5] D. Boneh, C. Gentry, B. Lynn and H. Shacham. Aggregate and veri.ably encrypted signatures from bilinear maps. In Proceeding of Eurocrypt 2003, vol. 2656 of LNCS, pp. 416–432. [6] S. Kent, C. Lynn and K. Seo. Secure border gateway protocol (secure-bgp). IEEE Journal Selected Areas in Comm., pp.582–592, 2000. [7] A. Lysyanskaya, S. Micali, L. Reyzin and Shacham H. Sequential aggregate signatures from trapdoor permutations. In Proceeding of Eurocrypt 2004, vol. 9999 of LNCS, pp. 74–90. [8] Z. Cao and L. Liu. On the Disadvantages of Pairing-based Cryptography. International Journal of Network Security, vol.17, no.4, pp.454-462, July 2015. [9] A. Shamir. Identity-based Cryptosystems and Signature Schemes. In Proceeding of Crypto 1984, vol. 196 of LNCS, pp 47–53. [10] D. Boneh and M. Franklin. Identity-based Encryption from the Weil Pairing. SIAM Journal of Computing, vol.32, no.3, pp.586–615, 2003 [11] C. Cocks. An Identity Based Encryption Scheme Based on Quadratic Residues. In Proc. of IMA Int. Conf., vol. 2260 of LNCS, pp.360–363, 2001 [12] Fiat A. and Shamir A., “How to prove yourself: Practical solutions to identification and signature problems”, In Proceeding of Crypto 1986, vol. 263 of LNCS, pp. 186–194. [13] L. C. Guillou and J. J. Quisquater. A Paradoxical Identity-Based Signature Scheme Resulting from Zero-Knowledge. In Proceeding of Crypto 1988, vol. 403 of LNCS, pp 216–231. [14] J. C. Cha and J. H. Cheon. An Identity-Based Signature From Gap Di.e-Hellman Groups. In Proceeding of PKC 2003, vol. 2567 of LNCS, pp.18–30 . [15] X. Boyen. Multipurpose Identity-Based Signcryption (A Swiss Army Knife For Identity-based Cryptography). In Proceeding of Crypto 2003, vol. 2729 of LNCS, pp 383–399. [16] Libert B. and Quisquater J.-J., “Identity based undeniable signatures,”. In Proc. of CT-RSA 2004, pp. 112–125. [17] M. Bellare, CH. Namprempre and G. Neven. Security Proofs for Identity-Based Identi.cation And Signature Schemes. In Proceeding Proc. of Eurocrypt 2004, vol.3027 of LNCS, pp.268-286. [18] S. Sharmila, D. Selvi, S. S. Vivek, J. Shriram and C. P. Rangan. Identity Based Partial Aggregate Signature Scheme Without Pairing. IACR Cryptology eprint Archive , 2010. [19] M. S. Murty, D. Veeraiah and A. S. Rao. Digital Signature and Watermark Methods For Image Authentication using Cryptography Analysis. Signal & Image Processing : An International Journal (SIPIJ) Vol.2, No.2, June 2011. [20] N. Dey, P. Das, S. S. Chaudhuri , and A. Das. A Session Based Watermarking technique Within the NROI of Retinal Fundus Images for Authencation Using DWT Spread Spectrum and Harris Corner Detection. In the Proceeding of the Fifth International Conference on Security of Information and Networks, 2012. [21] N. Dey, A. K. Pal, S. Samanta, A. Das, and S. S. Chaudhuri. Optimisation of Scaling Factors in Electrocardiogram Signal Watermarking using Cuckoo Search. International Journal of Bio-Inspired Computation vol.5, no. 5, pp.315-326, 2013. [22] N. Dey, G. Dey, S. Chakraborty, and S. S. Chaudhuri. Feature Analysis of Blind Watermarked Electromyogram Signal in Wireless Telemonitoring. In the series Annals of Information Systems vol. 16 pp. 205-229, 2014. [23] N. Dey, M. Dey, S. K. Mahata and D. Ach. Tamper Detection of Electrocardiographic Signal E. Abouelkheir et al. using Watermarked Bio-hash Code in Wireless Cardiology. Inteligence International Journal of Signal and Imaging Systems Engineering , Vol.8, no.1-2 , 2015. [24] Amar, Y. B., I. Trabelsi, N. Dey, and S. Bouhlel. Euclidean Distance Distortion Based Robust And Blind Mesh Watermarking. International Journal of Interactive Multimedia and Artificial , vo.4, No.2, pp.46-51,2016. [25] V. S. Miller. Use of elliptic curves in cryptography. In the Proceedings of the CRYPTO 1985, LNCS, Springer-Verlag vol.218,pp.417-426, 1985. [26] N. Koblitz. Elliptic Curve Cryptosystem. Journal of Mathematics of Computation vol.48, pp. 203-209, 1987. [27] K. Magons. Applications and Benefits of Elliptic Curve Cryptography. University of Latvia, Faculty of Computing, Rain¸a bulvaris 19, Riga, LV-1586, Latvia [28] C. Gentry, and Z. Ramzan. Identity-Based Aggregate Signatures. In the Proceeding of 9th International Conference on Theory and Practice in Public-Key Cryptography, New York, NY, USA, pp.24-26, 2006. Static and IncrementalOverlapping Clustering Algorithmsfor Large Collections Processing in GPU Lázaro Janier González-Soler, Airel Pérez-Suárez and Leonardo Chang AdvancedTechnologies Application Center (CENATAV) 7maA#21406, Playa, CP: 12200,Havana, Cuba E-mail: jsoler@ceneatav.co.cu and http://www.cenatav.co.cu/index.php/profle/profle/userprofle/jsoler E-mail: asuarez@cenatav.co.cu and http://www.cenatav.co.cu/index.php/profle/profle/userprofle/asuarez E-mail: lchang@cenatav.co.cu and http://www.cenatav.co.cu/index.php/profle/profle/userprofle/lchang Keywords:data mining, clustering, overlapping clustering, GPU computing Received: November 18, 2016 Pattern Recognition and Data Mining pose several problems in which, by their inherent nature, it is con­sidered that an object can belong to more than one class; that is, clusters can overlap each other. OClustR and DClustR are overlapping clustering algorithms that have shown, in the task of documents clustering, the better tradeoffbetween qualityofthe clustersandeffciency, amongtheexistingoverlapping clustering algorithms. Despite the good achievements attained by both aforementioned algorithms, they are O(n2) so theycould be less useful in applications dealing with a large number of documents. Moreover, although DClustR can effciently process changes in an already clustered collection, the amount of memory it uses could make it not suitable for applications dealing with very large document collections. In this paper, two GPU-based parallel algorithms, named CUDA-OClus and CUDA-DClus, are proposed in order to enhance the effciency of OClustR and DClustR, respectively, in problems dealing with a very large number of documents. The experimental evaluation conducted over several standard document collections showed the correctness of both CUDA-OClus and CUDA-DClus, and also their better performance in terms of effciencyand memory consumption. Povzetek: OClustR in DClustR sta prekrivna algoritma za gruˇ cenje, ki dosegata dobre rezultate, vendar je njunakompleksnostkvadratnega redavelikosti.Vtem prispevku sta predstavljenadva paralelna algo­ritma, ki temeljita na GPU: CUDA-OClus in CUDA-DClus. Veksperimentih sta pokazala zmožnost dela z velikimikoliˇ cinami podatkov. 1 Introduction Clustering is a technique of Machine Learning and Data Mining that has been widely used in several contexts [1]. This technique aims to structure a data set in clusters or classes such that objects belonging to the same class are more similar than objects belonging to different classes [2]. There are several problems that,bytheir inherent nature, consider that objects could belong to more than one class [3,4,5];thatis, clusterscanoverlapeachother.Mostofthe clustering algorithmsdeveloped sofardo not consider that clusters could share elements; however, the desire of ade­quately target those applications dealing with this problem, haverecentlyfavoredthedevelopmentof overlapping clus­tering algorithms;i.e., algorithms that allow objects to be­long to more than one cluster. An overlapping clustering algorithm that has shown, in the task of documents clus­tering, the better tradeoff between quality of the clusters and effciency, among the existing overlapping clustering algorithms, is OClustR [6]. Despite the good achievements attained by OClustR in the task of documents clustering, it has two main limitations: 1. It has a computational complexity of O(n2), so it could be less useful in applications dealing with a large amount of documents. 2. It assumes that the entire collection is available be­fore clustering. Thus, when this collection changes it needs to rebuild the clusters starting from scratch; that is, OClustR does not use the previouslybuilt cluste­ring for updating the clusters after changes. In order to overcome the second limitation, the DClustR algorithm was proposed by Pérez-Suárez et al. in [7]. DClustR introduced a strategy for effciently updating the clustering after multiple additions and/or deletions from the collection, making it suitable for handling overlapping clustering in applications where the collection changes fre­quently, specially for those applications handling multi­ple changes at the same time. Nevertheless, DClustR still suffers from the frst limitation; that is, like OClustR, it is O(n2). This implies that when the collection grows a lot, the time that DClustR uses for processing the chan­ges could make it less useful in real applications. Moreo­ver, when the collection grows, the memory space used by DClustR for storing the data it needs will also grow, ma­king DClustR a high memory consumer and consequently, making it not suitable for applications dealing with large collections. Motivatedbytheabovementionedfacts,inthis work we extend both OClustR and DClustR for effciently processing very large document collections. A technique that has been widely used in recent years in order to speed-up computing tasks is parallel computing and specifcally, GPU computing. AGPU is a device that was initially designed for processing algorithms belonging tothe graphicalworld,butduetoitslow cost,itshighlevel of parallelism and its optimized foating-point operations, it has been used in many real applications dealing with a large amount of data. The main contribution of this paper is the proposal of two GPU-based parallel algorithms, namely CUDA-OClus and CUDA-DClus, which enhance the effciencyof OClustR and DClustR, respectively, in problems dealing with a very large number of documents, like for instance news analysis, information organization and profles iden­tifcation, among others. Preliminary results of this paper were published in [8]. The main differences of this paper with respect to the conference paper presented in [8] are the following: (1) we introduce a new GPU-based algorithm, named CUDA-DClus, which is a parallel version of the DClustR algo­rithm, that is able to effciently process changes in an alre­ady clustered collection and to effciently process large col­lections of documents, and (2) we introduce a strategy for incrementallybuilding and updating the connected compo­nents presented in a graph, allowing CUDA-DClus to mi­nimize the memory needed for processing the whole col-lection.Itis importantto highlightthatinCUDA-DCluswe only analyze the additions of objects to the collection, be-causethisisthe casein whichit couldbediffculttoapply DClustR in real applications dealing with large collections, since this is the case that makes the collection grow. The remainder of this paper is organized as follows: in Section 2, a brief description of both the OClustR and DClustR algorithms are presented. In Section 3, the CUDA-OClus and CUDA-DClus parallel clustering algo­rithms are proposed. An experimental evaluation, showing the performance of both proposed algorithms on several document collections, is presented in Section 4. Finally, the conclusions as well as some ideas about future directi­ons are presented in Section 5. 2 OClustR and DClustR algorithms In this section, both the OClustR [6] and DClustR [7] al­gorithms are described. Since DClustR is the extension L.J. González-Soler et al. section were taken from [6, 7]. 2.1 The OClustR algorithm In ordertobuilda setofoverlapping clusters froma col­lection of objects, OClustR employs a strategy comprised of three stages. In the frst stage, the collection of objects is represented by OClustR as a weighted thresholded simila­rity graph. Afterwards, in the second stage, an initial set of clustersisbuilt througha coverof the graph representing the collection, using a special kind of subgraph. Finally, in the third stage the fnal set of overlapping clusters is obtai­nedbyimprovingthe initialsetof clusters.Following,each stage is briefy described. Let O = {o1,o2,...,on} be a collection of objects, ß . [0, 1] a similarity threshold, and S:O × O ›< a sym­metric similarity function. A weighted thresholded simila­rity graph, denoted as Geß = hV, Eeß ,Si, is an undirected and weighted graph such that V = O and there is an edge (v, u) . Eeß iff S(v, u) . ß;each edge(v, u) . e6 Eß ,v = u is labeled with the value of S(v, u). As it can be infer­red, in the frst stage OClustR must compute the similarity between each pair of objects; thus, the computational com­plexity of this stage is O(n2). e Once Gß is built, in the second stage OClustR builds e an initial set of clusters through a covering of Gß, using weighted star-shaped sub-graphs. e Let Gß = hV, Eeß,Si be a weighted thresholded simila­rity graph. A weighted star-shaped sub-graph (ws-graph) in Geß, denoted by G? = hV ?,E?,Si, is a sub-graph of e Gß, having a vertex c . V ?, called the center of G?, such that there is an edge between c and all the other vertices in V ? \{c};thesevertices are calledsatellites. All vertices in e Gß having no adjacent vertices (i.e., isolated vertices) are considered degenerated ws-graphs. e Forbuildingacoveringof Gß using ws-graphs, OClustR mustbuilda set W = {G? 1,G? 2,...,G? } of ws-graphs of k Sk Geß, such that V = V ? , being V i? , .i =1 ...k, the i=1 i set of vertices of the ws-graph G?i . For solving this pro­blem, OClustR searches for a list C = {c1,c2,...,ck}, such that ci . C is the center of G? . W , .i =1..k. In i the following, we will say that a vertex v is covered if it belongs to C or if it is adjacent to a vertex that belongs to C.For pruning the search space and for establishing a cri­terioninordertoselecttheverticesthatshouldbe included in C, the concept of relevance of a vertex is introduced. The relevance of a vertex v, denoted as v.relevance, is defned as the average between the relative density and the relative compactness of a vertex v, denoted as v.densityR and v.compactnessR, respectively, which are defned as follows: |{u . v.Adj/ |v.Adj|.|u.Adj|}| v.densityR = , |v.Adj| of OClustR for effciently processing collections that can ?v ) . AIS(G ?u |{u . v.Adj/AIS(G )}| v.compactnessR = , change due to additions, deletions and modifcations, the |v.Adj| OClustR is frst introduced and then, the strategy used by where v.Adj and u.Adj are the set of adjacent vertices DClustR for updating the clustering after changes is pre-of v and u, respectively; G? and G? are the ws-graphs de­ vu sented. All the defnitions andexamples presentedin this terminedbyvertices v and u, and AIS(G? ) and AIS(G? ) vu are the approximated intra-cluster similarity of G? and G? , vu respectively. The approximated intra-cluster similarity ofa ws-graph G? is defned as the average weight of the edges existing in G? between its center and its satellites. Based on the abovedefnitions, the strategy that OClustR usesin ordertobuildthelist C is composed of three steps. First,a candidate list L containingtheverticeshaving rele­vance greater than zero is created; isolated vertices are di­rectly included in C. Then, L is sorted in decreasing order of their relevance and each vertex v . L is visited. If v is not covered yet or it has at least one adjacent vertex that is not covered yet, then v is added to C. Each selectedvertex, together with its adjacent vertices, constitutes a cluster in the initial set of clusters. The second stage of OClustR also has a computational complexity of O(n2). Figure1shows throughanexample,the steps performedbyOClustRinthe second stage forbuilding the initial setof clusters. Finally, in the third stage, the fnal clusters are obtained thougha process which aims to improve the initial clusters. With this aim, OClustR processesC in order to remove the vertices forming a non-useful ws-graph. A vertex v forms a non-useful ws-graph if: a)there is at least another ver­tex u . C such that the ws-graph u determines includes v as a satellite, and b)the ws-graph determined byv shares more vertices with other existing ws-graphs than those it only contains. For removing non useful vertices, OClustR uses three steps. First, the vertices in C are sorted in des­cending order according to their number of adjacent verti­ces. After that, each vertex v . C is visited in order to remove those non-useful ws-graphs determinedbyvertices in (v.Adj . C). If a ws-graph G? , with u . (v.Adj . C), u is non-useful, u is removed from C and the satellites it only covers are “virtually linked” to v by adding them to a list named v.Linked;in this way, those vertices virtually lin­ked to v will also belong to the ws-graph v determines. Once all vertices in (v.Adj . C) are analyzed, v together with the vertices in v.Adj and v.Linked constitute a fnal cluster. This third stage also has a computational complex­ity of O(n2). Figure2shows through anexample,how the fnal clusters are obtained from the initial clusters showed in Figure 1(d). 2.2 Updating the clusters after changes: the DClustR algorithm e Let Gß = hV, Eeß,Si be the weighted thresholded simi­larity graph that represents an already clustered collection O. Let C = {c1,c2,...,ck} be the set of vertices repre- e senting the current covering of Gß and consequently, the current clustering. When somevertices are added to and/or e removed from O (i.e., from Gß ), there could happen the following two situations: 1) Some vertices become uncovered. This situation occurswhenatleastoneoftheaddedverticesis unco­vered or when those vertices of C covering a specifc vertex were deleted from Geß. (b) Final set of overlapping clusters Figure 2: Illustration of how the fnal clusters are obtained by OClustR in the third stage. 2) The relevance of some vertices changes and, as a con­sequence, at least one vertex u/. C appears such that u has relevance greater than at least one vertex in C that covers vertices in u.Adj .{u}. Vertices like u could determine ws-graphs with more satellites and less overlapping with other ws-graphs than other ws- e graphs currently belonging to the covering of Gß. Figure3, illustrates the abovecommented situationsover e the graph Gß of Figure 1(a). Figure. 3(a), shows the graph e Gß beforethe changes;theverticestobe removed are mar­ e ked with an “x”. Figure 3(b), shows graph Gß after the changes;vertices flled withlight gray represent the added vertices. Figures 3(c) and 3(d), show the updated graph e Gß with vertices labeled with letters and with their up­dated value of relevance, respectively; vertices flled with black correspond with thosevertices currently belongingto C. As it can be seen from Figures 3(c) and 3(d), vertices S, F, G, I, H and J became uncovered after the changes, while vertex B, which does not belong to C, has a rele­vance greater than vertex D, which already belongs to C. Taking into account the above mentioned situations, in order to update the clustering after changes DClustR frst e detects which are the connected components of Gß that wereaffectedby changes and thenit iteratively updates the covering of these components and consequently, their clus-tering. The connected components that are affected by changes are those that contain vertices that were added or verti­ces that were adjacent to vertices that were deleted from e Gß. Since DClustR has control over these vertices it can build these components through a depth frst search, star­ting from anyof these vertices. Let G0 = hV 0,E0,Si be a connected component affected by changes, whose cove­ring must be updated. Let C0 . C be the set of vertices of G0 which determine ws-graphs (i.e., clusters) covering G0. DClustR follows the same principles of OClustR; that is,it frstbuilds or completes the coveringof G0 in order to build an initial set of clusters (stage 1) and then, it impro­ves these clusters in order tobuild the fnal set of clusters of G0 (stage 2). Infact, DClustR uses the same steps that OClustR for the above two mentioned stages, but unlike OClustR, DClustR modifes the way in which the candi­date list L, usedin stage1,isbuilt. In order tobuild candidate list L, DClustR frst recom­putesthe relevancevalueofallverticesin G0 and it empties the list c.Linked, for all vertices c . C0;this last action is supportedbythefactthat, after changes, there couldbews-graphs that were considered as non useful, which could be no longer so. Let V+ . (V 0 \ C0) be the set of vertices of G0 with relevance greater than zero, which do not belong to C0.Forbuilding the candidate list L, bothC0 and V+ are processed. For processingV+, DClustR visits each vertex v . V+ and it verifes a)ifv is uncovered, or b)if at least one ad­jacent vertex of v is uncovered, or c)if there is at least one vertex u . v.Adj, such that there is no other vertex in C0 covering u whose relevance is greater than or equal to the relevance of v. If anyof these three conditions is fulflled, v is added to L. Additionally, if the last condition is ful­flled, all those vertices like u are marked as “activated” in order to use them when C0 is being processed. The compu­tational complexity of the processing of V+ is O(n2). For processingC0, DClustR visits the adjacent vertices of each vertex v . C0. Any vertex u . v.Adj having grea­ter relevance than v is added to L;in these cases,v is addi­tionally marked as “weak”. Once all the adjacent vertex of v were visited, if v was marked as “weak” or at least one of its adjacent vertices were previously marked as “active”, v is removed from C0 sinceit couldbe substitutedbya more relevant vertex. However, if v has a relevance greater than zero, it is still considered as a candidate and consequently, it is added to L. The computational complexity of the pro­cessing of C0 is O(n2). Figure 4, shows the updated set of overlapping clusters obtained by DClustR when it processes the graph in Fi­gure 3(d); vertices flled with black represent the vertices determining ws-graphs that cover each connected compo­nent of Geß. Like OClustR, the computational complexity of DClustR is O(n2). Informatica 42 (2018) 229–244 233 Figure 4: Updated set of overlapping clusters obtained by DClustR. 3 Proposed parallel algorithms As it was mentioned in Section 1, despite the good achie­vements attained by OClustR and DClustR in the task of documents clustering, these algorithms are O(n2) so they could be less useful in applications dealing with a very large numberof documents. Motivatedby thisfact,in this section two massively parallel implementations in CUDA of OClustR and DClustR are proposed in order to enhance the effciencyof OClustR and DClustR in the above menti­oned problems. These parallel algorithms, namely CUDA-OClus and CUDA-DClus, take advantage of the benefts of GPUs, like for instance, the high bandwidth communica­tion between CPU and GPU, and the GPU memory hierar­chy. Although in their original articles both OClustR and DClustR were proposed as general purpose clustering al­gorithms, the parallel extensions proposed in this work are specifcally designed for processing documents. This ap­plication context is the same in which both OClustR and DClustR were evaluated and it is also a context in which very large collections are commonly processed. In the context of document processing, both CUDA-OClus and CUDA-DClus use the cosine measure [9] for computing the similarity between two documents; this measure is the function that has been used the most for this purpose [10]. The cosine measure between two documents di and dj is defned as: P m di(k) * dj(k) k=1 cos(di,dj )= , (1) kdik·kdjk where di(k) and dj(k) are the weights of the k term in the description of the documents di and dj , respectively; kdik and kdj k are the norms of documents di and dj, respecti­vely. In experiments conducted over several document col­lections, it was verifed that the frst stage of OClustR, the construction of the similarity graph, consumes the 99% of the processing time of the algorithm. The remaining 1% is mainly dominated by the computation of the relevance of thevertices. Based on thisfact, the above two mentio­ned steps are the ones that will be implemented in CUDA by CUDA-OClus; remaining steps are high memory con­suming tasks that are morefavored witha CPU implemen­tation. Analogously, in these experiments it was also ve­rifed that the most time consuming steps of DClustR are the updating of the graph after changes and the recompu­ting of the relevance, so these steps will be implemented in CUDA by CUDA-DClus. In this case, it could be no­ticed also that the detection of the connected components affected by changes is a high memory consuming task per­formed by DClustR, so it is also important to address this problem in CUDA-DClus. Finally, it is also important to mention that since we are dealing with the problem of processing very large do­cument collections, CUDA-DClus only tackles additions, which are the changes that could increase the size of the collection. Implementing deletionsis irrelevantforoverco-ming problems related with large document collections. Following, the CUDA-OClus algorithm is frst introdu­ced and then, the CUDA-DClust algorithm is presented. 3.1 CUDA-OClus algorithm Let D = {d1,d2 ...,dn} beacollectionof documents des­cribed by a set of terms. Let T = {t1,t2,...,tm} be the list containing all the different terms that describe at least one document in D. CUDA-OClus represents a document di . D by two parallel vectors, denoted by Tdi and Wdi . The frst one contains the position that the terms describing di have in T , and the second one contains the weights that those terms have in the description of di. Forbuilding Geß = hV, Eeß,Si, OClustR demands S to be a symmetric similarity measure, so the similarity bet- e ween anytwo documents (i.e., vertices in Gß)needs to be computed only once. Based on this fact and considering the inherent order the documents have inside a collection D (i.e., vertices in V ),forbuildingtheedges relativestoa vertex v . V it is only necessary to compute the similarity between v and each vertex following v in V . Let Sucv be the list of vertices that follow a vertex v in V . To speed up the construction of Geß, for each vertex v . V , CUDA-OClus will compute in parallel the similarity between v and the vertices in Sucv. Considering the defnition of the cosine measure, it can be seen from Expression (1) that its numerator is a sum of independent products which could be computed all at once. On the other hand, taking into account that the norm of a document can be computed while the document is being read, the denominator of Expression (1) can be also resol­ved with noextra time. Based on thesefacts, CUDA-OClus also parallelizes the computation of the similarity between a pair of vertices, in order to speed up even more the con­ e struction of Gß. In order to carry out the previous idea, CUDA-OClus builds a grid comprised of k square blocks, each block having a shared memory square matrix (SMM); where n k = +1 and t is the dimension of both the blocks and t the matrices. A grid is a logic representation of a matrix L.J. González-Soler et al. of threads in the GPU. The use of SMM and its low la-tencywill allow CUDA-OClus to not constantly access the CPU memory, speeding up the calculus of the similarity between two vertices. CUDA-OClus assigns to t the maxi­mum value allowed by the architecture of the GPU for the dimension of a SMM. When CUDA-OClus is going to compute the similarity between a vertex v and the vertices in Sucv, it frstbuilds a vector Pv of size m. This vector has zero in all its entries excepting in those expressed by the positions stored in Tv; these last entries contain their respective weights stored in Wv. Once Pv has beenbuilt, the list Sucv is partitioned into k sublists. Each one of these sublists is assigned to a block constituting the grid and the SMM associated with that block is emptied; i.e., all its cells are set to zero. When a sublist Q = {v1,v2,...,vp} is assigned to a block in­side a grid, each vertex in Q is assigned to a column of the block. In this context, to assignavertex vi toa column me­ans that eachrowofthe column pointstoa term describing vi; in this way, the jth row points to the jth term descri­bing vi. Figure5showsanexampleofhowthelist Sucv is divided by CUDA-OClus into k sublists and how these su­blists are assigned to the blocks constituting the grid. The example on Figure5showshow the frstvertexof the su­blist assigned to “block 0” is assigned to the frst column of that block; the other assignments could be deduced from this example. Figure 5: Illustration of how CUDA-OClus divides Sucv and assigns each resulting sublist to the blocks. Each row inside a column of a block has a thread that performs a set of operations. In our case, the threads asso­ciated with the ith column will compute the similarity bet­ween v and its assignedvertex vi.With this aim, the thread associated with each row inside the ith column will com­pute the product between the weight that the term pointed by that row has in the description of vi, and the weight this same termhasinthe descriptionofvertex v. It is important to note that although the sum in the numerator of Expres­sion (1) runs over all the terms in T , the products that will be different from zero are only those between terms shared byboth documents; this is the reason we only use the terms of vi and multiply their weights by the weights that these terms have in v;remaining terms inv are useless. Given that the jth row of the column to which vertex vi has been assigned, points to the jth term of Tvi , the weight this term has in vi is stored at the jth position of Wvi and the weight this same term has in v is stored at Pv, in the entry referredby thevalue stored at the jth position of Tvi . The result of the product between the above mentioned weights is added to the value the jth row already has in the SMM. If the description of a vertex vi assigned to a column of a block exceeds the length of the column (i.e., t)atiling isappliedatthis block.Tiling[11]isa technique that consists on dividing a data set into a number of small subsets, such that each subset fts into the block; i.e., the SMM. Thus, when the rows of a column point at the next t terms, the products between the weights these terms have in the description of vi and v are computed and accumu­lated into the values these rows have in the SMM. This technique is applied until all the terms describing the verti­ces assignedtothe columnshavebeen processed. Figure6 shows how the similarity between the vertex v1 assigned to the frst column of “Block 0” and v is computed. In this example, it has been assumed that there are 15 terms describing the documents of the collection, the size of the block is k =5, and Tv = {1, 2, 5, 8, 10, 12, 14}, Wv = {0.2, 0.6, 0.3, 0.7, 0.2, 0.1, 0.5}, Tv1 = {2, 3, 5, 8, 10, 11, 12, 14} and Wv1 = {0.5, 0.3, 0.4, 0.8, 0.2, 0.6, 0.3, 0.3}. As it can be seen from Figure 6(a), each thread of the t rows of the frst column computes the product between the weight of the term it points at, and the weight this same term has in Pv (i.e., the description of v). As it was menti- oned before, the computed products are stored in the SMM of that block. Note from Figure 6(a) that the product com­puted by the second row is zero since vertex v does not contain the term pointed out by this row; i.e., term having index 3rd in T . Figure 6(b) shows how whenTiling is ap­plied, the remaining terms describing v1 are pointedby the rows of the frst column. Figure 6(c) shows how the pro­ducts between the remaining terms of v1 and v are perfor­med. Finally, Figure7shows which are thevalues stored in the frst column of the SMM of “Block 0”, once all the products have been computed. Once all the terms describing the vertices assigned to a blockhavebeen processed,a reductionis appliedover each column of the block. Reduction [12] is an operation that computes in parallel the sum of all the values of a column of the SMM and then, it stores this sum in the frst row of (c) Computing remaining products the column. Figure8shows the fnalsum obtained for the frst column of “Block 0”. Figure 6: Illustration of how CUDA-OClus computes the The sum obtained on the column to with vertex vi has similarity between a vertex v and the vertices in Sucv. been assigned corresponds with the numerator of the co­sine measure between v and vi. This sum is then divided by the product of the norms of v and vi, which have been result CUDA-OClus decides if it should create or not an previously computed; the result of this division (i.e., the si-edge between v and vi, during this step CUDA-OClus also milarity between v and vi)is copied to the CPU. Using this updates the value of AIS(v) and AIS(vi). Figure 7: Final results stored in the SMM after processing all terms of v1. Figure 8: Result of applying Reduction on the frst column of “Block 0”. The pseudocode of cosine similarity function is shown in Algorithm 1. e Once the thresholded similarity graph Gß has beenbuilt, CUDA-OClus speeds up the computation of the other time-consuming step: the computation of the relevance of the vertices. In order to do that, CUDA-OClus computes in pa­rallel the relevance of all the vertices of Geß. Moreover, for each vertex v, CUDA-OClus computes in parallel the con­tribution each adjacentvertexof v hasoverthe relevanceof v, speeding up even more the computation of the relevance of v. In order to accomplish this idea, the list of vertices of e Gß is partitioned into k sublists and each sublist is assigned to a block inside a grid. However, in this case, when a ver­tex vi of a sublist is assigned to a column of a block, each row in that column will point to an adjacent vertex of vi; e.g., the jth row points at the jth adjacentvertexof vi. Dif- L.J. González-Soler et al. Algorithm 1: CUDAimplementation of the cosine similarity function. Input: Sucv the list of vertices that follow a vertex v, Pv weights of vertex v, Wv weights associated to v, Tv position of terms that represent to v, Normv is the norm of v Output: similarity: cosine similarity values betweenv and Sucv 1 __shared__ float SMM[R][C] ; // R = C because block are squared 2 int tid = threadIdx.x + blockDim.x * blockIdx.x; 3 if (tid < |Sucv|)then 4 u = Sucv[tid]; 5 int tidy = threadIdx.y; 6 float sum =0; 7 while (tidy < |Wu|)do /* Accumulating the multiplication between weights of v and u */ 8 sum += Wu[tidy] * Pv[Tu[tidy]]; 9 tidy += R ; // Applying tiling technique 10 SMM[threadIdx.y][threadIdx.x]= sum; /* Waiting that whole threats compute multiplications between weights of v and u . Sucv */ 11 __syncthreads(); /* Applying reduction P m k=1 di(k) * dj(k) technique to calculate */ 12 int i = R/2; 13 while (i ! = 0)do 14 if (threadIdx.y < i)then 15 SMM[threadIdx.y][threadIdx.x] += SMM[threadIdx.y + i][threadIdx.x]; 16 __syncthreads(); 17 i = i/2; 18 19 if (threadIdx.y == 0 && tid < |Sucv|)then similarity[tid] = 0; 20 if (Normv > 0 && Normu > 0)then /* Dividing between the multiplication of norms of v and u */ 21 similarity[tid] = SMM[0][threadIdx.x]/(Normv * Normu) e ferent frombuilding graph Gß, now the threads associated with a column will compute the relevance of its assigned vertex. With this aim, the thread on each row of that co­lumn will compute the contributions the vertex pointed by thatrowhasovertherelevanceofthevertex assignedtothe column. Let v be a vertex assigned to a column and u one of its 1 adjacent vertices. Vertexu contributes to the rele- |v.Adj| vance of v if |v.Adj|.|u.Adj|; otherwise, its contribu­tion is zero. This case represents the contribution u has to the relevance of v through the relative density of v. On 1 the other hand, u contributes to the relevance of v |v.Adj| if AIS(v) . AIS(u);otherwise, its contribution is zero. This other case represents the contribution u has to the re­levance of v through the relative compactness of v. The total contribution provided by a vertex is added to the va­lue the row already has in the SMM; similar to the case of e building graph Gß,the SMM of each block is initially emp­tied. If v has more than t adjacentvertex,aTilingis app­lied. Onceallthe adjacentverticesof v has been processed, a Reduction is applied in order to compute the relevance of v. Obtained values are then copied to the CPU. The pseudocode of cosine similarity function is shown in Algorithm 2. As it was mentioned before, the remaining steps of OClustR were not implemented in CUDAbecause theyare Algorithm 2: CUDA implementation of the rele­vance function. e Input: Gß weights threshold similarity graph,AIS( Geß ) is the e approximated intra-cluster similarity of Gß Output: relevance: relevance values of vertices 1 __shared__ float SMM[R][C] ; // R = C because block are squared 2 int tid = threadIdx.x + blockDim.x * blockIdx.x; 3 if (tid < |V |)then 4 int tidy = threadIdx.y; 5 float sum =0; 6 while (tidy < |Adj[tid]|)do /* Checking if the density and compactness conditions are met */ 7 if (|Adj[tidy]|.|Adj[tid]|)then 8 sum+ = 1; 9 if (|AIS[tidy]|.|AIS[tid]|)then 10 sum+ = 1; 11 tidy += R ; // Applying tiling technique 12 SMM[threadIdx.y][threadIdx.x]= sum; /* Waiting that whole threats check density and compactness conditions */ 13 __syncthreads(); /* Applying reduction technique to calculate relevance */ 14 int i = R/2; 15 while (i != 0)do 16 if (threadIdx.y < i)then 17 SMM[threadIdx.y][threadIdx.x] += SMM[threadIdx.y + i][threadIdx.x]; 18 __syncthreads(); 19 i = i/2; 20 if (threadIdx.y == 0&& tid < |V |)then 21 relevance[tid]=0; 22 if (|Adj[tid] > 0|)then /* Dividing between the number of adjacents of the current vertex */ 23 relevance[tid]= SMM[threadIdx.y][threadIdx.x]/(2 *|Adj[tid]|); more favored with a CPU implementation since they are high memory consumption tasks. 3.2 CUDA-DClus algorithm In order to update an already clustered collection when changes take effect, in our case additions, DClustR frst de- e tects, in the graph Gß representing the collection, which are the connected components that were affected by chan­ges and then, it updates the cover of those components and consequently, the overall clustering of the collection. As it was stated in Section 2.2, the connected compo­nentsaffectedbyadditions are those containingat least one e added vertex. Thus, each time vertices are added to Gß, in addition to computing the similarity between thesevertices and those already belonging to Geß in order to create the respective edges, DClustR also needs to build from scra­tch each affected connected component in order to update their covers. In order to reduce the amount of informa­tion DClustR needs to store in memory, CUDA-DClus pro- e poses to represent the graph Gß using an array of partial connected components, named ArrP CC , and two parallel arrays. The frst of these parallel arrays, named V , con­tains the vertices in the order in which they were added to e Gß. The second array, named PCV , contains the index Informatica 42 (2018) 229–244 237 of the partial connected component to which each vertex belongs. This new representation allows CUDA-DClus to not need to rebuild the affected components each time the collection changes, but keeping the affected components e updated each timevertices are added to the graph Gß, with no extra cost. Let Geß = hV, Eeß ,Si be the thresholded similarity graph representing the collection of documents. A partial con­ e nected component (PCC) in Gß is a connected subgraph e induced by a subset of vertices of Gß .Apartial connected component is represented using two arrays: one array con­taining the indexes the vertices belonging to that compo­nent have in Geß, and the other array containing the adja­cencylist of the aforementioned vertices. The array of partial connected components representing e Gß isbuilt once while Geß is being constructed. The stra­tegy used by CUDA-DClus for this purpose is as follows. In the frst step, CUDA-DClus adds a vertex in V for each document of the collection and then, PCV is emptied (i.e., it is flled with -1), meaning that the vertices do not be­long to any PCC yet. In the second step, CUDA-DClus processes each vertex vi . V . If vi does not belong yet to a PCC, CUDA-DClus creates a new PCC and it puts vi in this component; when a vertex v is added to a PCC, the index this PCC has in the array ArrP CC is stored in the array PCV , at the entry referred to by the index v has in array V ;this is the way CUDA-DClus uses to indicate that now v belongs toa PCC.Following, CUDA-DClus com­putes the similarity between vi and the vertices in Sucvi , using the strategy proposed by CUDA-OClus. Once these similarities have been computed, CUDA-DClus visits each vertex u . Sucvi . If S(vi,u) . ß and u does not belong to anyPCC yet, then u is inserted in the PCC to which vi belongs and the adjacency lists of both vertices u and vi are modifed in order to indicate they are similar to each other; otherwise, if u already belongs to a PCC only the adjacency lists of both vertices are modifed. In this last case, if the partial connected components to which both vi and u belong are not the same, we will say that these partial connected components are linked. e As an example, let Gß be initially empty and D = {d1,d2,...,d9} be the set of documents that will be added tothe collection.Forthesakeof simplicity,wewill assume e that CUDA-DClus already added the documents in Gß as vertices and that the similaritiesexisting between each pair of documents are those showedinTable1.Taking into ac­count the above mentioned information, Figures 9, 10, 11 and 12exemplifyhow CUDA-DClusbuilds the arrayof e partial connected components representing graph Gß, for ß =0.3. As it can be seen from Figure 9, frstly CUDA-DClus processes vertex v1 in order tobuild the frst PCC. As the result of the above process vertices v3 and v5 are added to the frst PCC, which is now constituted by vertices v1,v3 and v5. The second PCC is built when vertex v2 is pro­cessed, see Figure 10; this component is fnally constitu­ted by vertices v2,v4 and v7. Afterwards, as it can be seen Table 1: Similarities existing between each pair of vertices of the example. Vert./Vert. v1 v2 v3 v4 v5 v6 v7 v8 v9 v1 - 0 0.4 0 0.5 0 0 0 0 v2 0 - 0 0.4 0 0 0.5 0 0 v3 0.4 0 - 0.7 0.6 0.3 0 0 0 v4 0 0 0.7 - 0 0 0 0 0 v5 0.5 0 0.6 0 - 0 0 0 0 v6 0 0 0.3 0 0 - 0 0 0 v7 0 0 0 0 0 0 - 0 0 v8 0 0 0 0 0 0 0 - 0.5 v9 0 0 0 0 0 0 0 0.5 - Figure 9: Processing vertex v1. from Figure11, whenvertex v3 is being processed, CUDA-DClus updates the frst PCCby addingvertex v6 and upda­ting the adjacencylist of vertices v3 and v5;CUDA-DClus also updates the second PCC by modifying the adjacency list ofvertex v4,whichis similartovertexv3. In thisexam­ple, thesetwopartial connected components were joinedby adashlineinorderto illustratethefactthattheyarelinked since vertices v3, belonging to the frst PCC, and v4, be­longing to the second PCC, are similar. Finally, the third PCC is created when CUDA-DClus processes vertex v8, Figure 12: Processing vertex v8. as it can be seen in Figure 12. The processing of vertices v4,v5,v6,v7 and v9 does not affect the partial connected componentsbuiltsofar, therefore,itwasnot includedin the example. Wewould like to emphasize twofacts about the above commented process. The frstfact is that, since this is the e frst time the array ArrP CC representing Gß is built, all these components are already in system memory. The se­cond fact is that if we put a PPC Pi . ArrP CC into a set QPi and then, iteratively we add to QPi all the linked PCC of each PCC belonging to QPi , the resulting set is a connected component. Proof is straightforward by con­struction. Hereinafter, we will say that QPi is the con­nected component induced by PCC Pi. e Once the array ArrP CC representing Gß was built, CUDA-DClus processes each of its partial connected com­ponents in order to build the clustering. For processing a PCC Pi . ArrP CC that has not been processed in a previous iteration, CUDA-DClus frstbuilds QPi and then, CUDA-DClus recomputes the relevance of the vertices be­longing to this component using the strategy proposed by CUDA-OClus. Once the relevance of the vertices have been recomputed, CUDA-DClus follows the same steps used by DClustR for updating the covering and conse­quently,the clustering of QPi . Remaining steps of DClustR were not implementedin CUDAbecause they are morefa­vored witha CPU implementation. Once the clustering has been updated, CUDA-DClus stores theexisting partial con­nected components in the hard drive, releasing in this way the system memory. Once Geß changes due to the additions of documents to the collection, CUDA-DClus updates the array ArrP CC e representing Gß and then, it updates the current cluste­ring. In order to update the array ArrPCC , CUDA-DClus adds for each incoming document, a vertex in Geß and then, CUDA-DClus sets to -1 the entries that these verti­ces occupy in PCV , in order to express that they do not belong to anyPCC yet. Let M = {v1,v2,...,vk} be the set of added vertices. Afterwards, for processing a ver­tex vi . M, CUDA-DClus slightly modifes the strategy it uses for creating the partial connected components. Now, rather than computing the similarity of vi only with thever­tices that came after vi in V (i.e., Sucvi ), CUDA-DClus also computes the similarity of vi with respect to the ver- e tices that belong to Gß before the changes; that is, the si­milarities are now computed between vi and eachvertexin Sucvi . (V \ M). Remaining steps are the same. Let D1 = {d10,d11,...,d15} be the set of documents e that were added to the collection represented by graph Gß, whose arrayof partial connected componentswasbuiltin Figure 9, and let v10,v11,...,v15 be the vertices that were e consequently added in Gß by CUDA-DClus. For the sake of simplicity, in the example it is assumed that none of the vertices belonging to Geß before the changes is similar to the added vertices, with the only exception of v2 whose similarity with v10 is 0.5. Table2 shows the similarities between each pair of the added vertices. Figures 13, 14, 15 and 16 show, assuming ß =0.3, how CUDA-DClus upda­tes the array of partial connected components representing e Gß after the above mentioned additions. In these fgures, vertices flled with light gray are those that were added to the collection. v10 v11 v12 v13 v14 v15 v10 - 0.4 0.3 0.6 0 0 v11 0.4 - 0 0.4 0 0 v12 0.3 0 - 0.4 0 0 v13 0.6 0.4 0.4 - 0 0 v14 0 0 0 0 - 0.5 v15 0 0 0 0 0.5 - Table 2: Similarities existing between each pair of added vertices. As it can be seen in Figure 13, frstly, CUDA-DClus pro-cessesvertex v10 and,asa resultofthis processing, another PCCis createdfor containingvertices v10,v11,v12 and v13. This new PCCwas joined with the PCC determinedbyver-tex v2, througha dash line,in orderto refectthefact that they are linked since vertices v2 and v10 are similar. Furt­hermore, as it can be seen in Figures 14 and 15, this fourth PCCis updated whenvertices v11 and v12 are processed, in order to refect thefact that they are similar tovertex v13. Finally,a ffthPCCis createdwhenvertex v14 is processed, see Figure 16; this PCC contains vertices v14 and v15. Once the array ArrP CC has been updated, CUDA-DClus processes each new PCC following the same stra­tegy commented above, in order to update the current clus­tering. It is important to highlight that, different from when ArrP CC was created, this time the partial connected com­ponents loaded into the system memory are those belon­ging to the connected components determinedby each new created PCC; the other partial connected components re­main in the hard drive. Although in the worst scenario an incoming document can be similar to all existing docu­mentsinthecollection, generally similaritygraphsarevery sparse so it is expected that the new representation pro­posed by CUDA-DClus as well as the strategy it uses for updating the array of partial connected components, help CUDA-DClus to save system memory. 4 Experimental results In this section, the results of several experiments done in order to show the performance of the CUDA-OClus and CUDA-DClus algorithms are presented. The experiments were conducted over eight document collections and were focused on: (1) assessing the correctness of the proposed parallel algorithms wrt. their original non parallel versi­ons, (2) evaluating the improvement achieved by the pro­posed algorithms with respect to the original OClustR and DClustR algorithms, and (3) evaluating the memory both CUDA-DClus and DClustR consume when they are pro­cessing the same collection. All the algorithms were imple­mented in C++; the codes of OClustR and DClustR algo­rithms were obtained from their authors.For implementing CUDA-OClus and CUDA-DClus the CUDA Toolkit 5.5 was used. All the experiments were performed on a PC with Core i7-4770 processor at 3.40 GHz, 8GB RAM, ha-vinga PCIexpress NVIDA GeForceGT 635, with1GB DRAM. The document collections used in our experiments were built from two benchmark text collections com- L.J. González-Soler et al. monly used in documents clustering: Reuters-v2 and TDT2. The Reuters-v2 can be obtained from http://kdd.ics.uci.edu, while TDT2 benchmark can be obtained from http://www.nist.gov/speech/tests/tdt.html. From these benchmarks, eight document collections were built. The characteristics of these collections are shown in Table3. Asit canbe seen fromTable3, these collections are heterogeneous in terms of their size, dimension and the average size of the documents theycontain. Coll. #Docs. #Terms Terms/Docs. Reu-10K 10000 33370 27 Reu-20K 20000 48493 41 Reu-30K 30000 59413 50 Reu-40K 40000 70348 58 Reu-50K 50000 74720 64 Reu-60K 60000 81632 69 Reu-70K 70000 91490 76 Tdt-65K 65945 114828 210 Table 3: Overview of the collections used in our experi­ments. In our experiments, documents were represented using theVector Space Model (VSM) [13]. The index terms of the documents representthe lemmasofthewords occurring at least onceinthe collection; these lemmas wereextracted from the documents usingTree-tagger1. Stopwords such as: articles, prepositions and adverbs were removed. The index terms of each document were statistically weighted using their term frequency. Finally,the cosine measurewas used to compute the similarity between two documents [9]. 4.1 Correctness evaluation As it was mentioned before, the frst experiment was fo­cused on assessing the correctness of the proposed algo­rithms.With this aim, we will compare the clusteringsbuilt by CUDA-OClus and CUDA-DClus with respect to those built by the original OClustR and DClustR algorithms, un­der the same conditions. For evaluating CUDA-OClus we selected the Reu-10K, Reu-20K, Reu-30K, Reu-40K and Reu-50K collections; whilst for evaluating CUDA-DClus we selected Reu-10K, Reu-20K and Reu-30K collections. These collections were selected due to they resemble the collections over which both OClustR and DClustR were evaluated in [6] and [7], respectively. In ordertoevaluateCUDA-OClus,weexecuted OClustR and CUDA-OClus over the Reu-10K, Reu-20K, Reu-30K, Reu-40K and Reu-50K collections, using ß =0.25 and 0.35.We used these threshold values as these values obtai­ned the best results in several collections as reported in the original OClustR [6] and DClustR [7] articles. Then, we took the clustering results obtained by OClustR as ground truth and we evaluateds the clustering results obtained by CUDA-OClus in terms of their accuracy, using the FBcu­ 1http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger bed [14] and the Normalized Mutual Information (NMI) [15] external evaluation measures. FBcubedis oneoftheexternalevaluation measures most used for evaluating overlapping clustering algorithms and unlike of other external evaluation metrics, it meets with four fundamental constrains proposed in [14] (cluster ho­mogeneity, cluster completeness, rag bag and cluster size vs quantity). On the other hand, NMI is a measure of si­milarity borrowed from information theory, which has pro-vedtobe reliable[15]. Both metricstakevaluesin[0, 1], where 1 means identical results and 0 completely different results. In order to take into account theinherent data order dependencyof CUDA-OClus, we executed CUDA-OClus twenty more times over the above mentioned collections, for each parameter value, varying the order of their docu­ments.Table4 shows theaverage FBcubed and NMIva­lues attainedby CUDA-OClus for each selected collection, using ß =0.25 and 0.35. FBCubed Threshold Reu-10K Reu-20K Reu-30K Reu-40K Reu-50K ß=0.25 0.999 0.999 1.000 0.998 1.000 ß=0.35 0.999 1.000 1.000 1.000 0.999 NMI Threshold Reu-10K Reu-20K Reu-30K Reu-40K Reu-50K ß=0.25 0.997 0.999 1.000 0.999 1.000 ß=0.35 0.998 1.000 1.000 1.000 0.999 Table 4: Average FBcubed and NMI values attained by CUDA-OClus for each selected collection. As it can be seen from Table 4, the average FBcubed and NMI values attained by CUDA-OClus are very close to1, meaning that the clusters CUDA-OClusbuilds are al­most identical to thosebuiltby OClustR. The differences between the clusterings are caused by the inherent data or­der dependencyof the algorithms and also because of the different foating point arithmetic used by CUDA. In order to asses the validity of CUDA-DClus, in the se­condpartofthefrstexperiment,wewill comparetheclus­tering resultsbuiltby CUDA-DClus with respect to those obtainedby DClustR.With this aim, we obtaina ground truth by executing DClustR over the Reu-30K collection, also using ß =0.25 and ß =0.35, and then, we pro­cess Reu-20K and Reu-10K collections, in this order, as if they were additions of documents to the collection. That is, we are going to add the documents contained in Reu­20K to the current collection (i.e., Reu-30K) and update the clustering using DClustR and after that, we are goind to add Reu-10K to the collection resulting from previous additions (i.e., Reu-30K union Reu-20K) and update the clustering again. We repeated the above mentioned exe­cution under the same parameter confguration but using CUDA-DClus instead of DClustR and afterwards. Then, we take the results obtained by DClustR as ground truth and we evaluate each of the three clustering results obtai­ned by CUDA-DClus in terms of their accuracy, using the FBcubed and NMI external evaluation measures. Like in the frstpartofthisexperiment,weexecutedCUDA-DClus twenty times under the above mentionedexperimental con- Informatica 42 (2018) 229–244 241 fguration, each time varying the order of the documents inside the collections. Table 5 shows the average FBcu-bed and NMIvalues attainedby CUDA-DClus for each se­lected collection, using ß =0.25 and 0.35. FBCubed Threshold Reu-30K Reu-30K+Reu-20K Reu-30K+Reu­20K+Reu-10K ß = 0.25 0.999 0.995 0.998 ß = 0.35 0.995 0.996 0.991 NMI Threshold Reu-30K Reu-30K+Reu-20K Reu-30K+Reu­20K+Reu-10K ß = 0.25 0.998 0.994 0.999 ß = 0.35 0.997 0.998 0.995 Table 5: Average FBcubed and NMI values attained by CUDA-DClus for each selected collection. As it can be seen from Table 5, the average FBcubed and NMI values attained by CUDA-DClus are very close to1, meaning that the clustersitbuilds are almost identical to thosebuiltby DClustR. From this frstexperiment, we can conclude that the speed-up attained by CUDA-OClus and CUDA-DClus does not degrade their accuracywrt. the original non parallel versions. 4.2 Execution time evaluation In the second experiment, we evaluate the time impro­vement achievedby CUDA-OClus and CUDA-DClus with respect to OClusR and DClustR, respectively. With this aim, we execute both OClustR and CUDA-OClus over Reu-10K, Reu-20K, Reu-30K, Reu-40K, Reu-50K, Reu­60K and Reu-70K, using ß =0.25 and 0.35 and we mea­sured the time theyspent. Like in the previous experiment, in order to take into account the data order dependencyof both algorithms, we repeated the above mentioned execu­tions twenty times, for each collection and each parame­ter confguration, but varying the order of the documents of the collections. Figure 17 shows the average time both OClustR and CUDA-OClus spent for clustering each se­lected collection, for each parameter confguration. As it can be seen from Figure 17, CUDA-OClus isfas­ter than OClustR over each selected dataset and for both values of ß; for ß =0.25 and ß =0.35, CUDA-OClus is respectively 1.26x and 1.29xfaster than OClustR. It is important to note from Figure 17 that as the size of the pro­cessed collection grows, the difference in the time spent for each algorithm also grows; this behavior shows how well CUDA-OClus scale when the size of the collection grows. Wewould like to highlight thefact that the speci­fcations of the computer used in the experiments provided advantage to CPU-based algorithmsover GPU-based algo­rithms, since a Core i7-4770 processor at 3.40 GHz with 8GB RAM is superior to a PCI express NVIDA GeForce GT635,with1GBDRAM, whichonlyhastwo streaming processors and a limited memory. Hence, taking into ac­count the execution model of a GPU, in which the grid blocks are numerated and theydistributed among all stre­aming multiprocessors, which execute simultaneously one (b) ß =0.35 Figure 17: Time spent by OClustR and CUDA-OClus for clustering the selected experimental datasets, using ß = 0.25 and 0.35. task over a specifc block, then we expect that if we use a powerful GPU with more streaming multiprocessor, the difference between the processingtime achievedby paral­lel version and sequential version will be higher than the one showed in this experiments. In order to compare both DClustR and CUDA-DClus, in the second part of the second experiment, we clustered the Reu-50K collection using both algorithms and then, we measured the time each algorithm spent for updating the current clustering each time N documents of Tdt-65K col­lection are incrementally added to the existing collection. In this experiment we also used ß =0.25 and 0.35, and we set N = 5000 and N = 1000 which are much grea­ter values than those used to evaluate DClustR [7]. In or­der to take into account the data order dependencyof both algorithms, the above mentioned executions were also re­peated twenty times, for each collection and each parame­ter confguration, but varying the order of the documents of the collections. Figure 18 shows the average time both DClustR and CUDA-DClus spent for updating the current clustering, for each parameter confguration. As it can be seen from Figure 18, CUDA-DClus has a better behavior than DClustR, for each parameter confgu- (d) ß =0.35, N = 10000 Figure 18: Time spent by DClustR and CUDA-DClus for updating the current clustering, using ß =0.25 and 0.35, for N = 5000 and 10000. ration, when multiple additions are processed over the se­lected dataset, showing an average speed up of 1.25x and 1.29x for ß =0.25,N = 5000 and ß =0.35,N = 5000 respectively. Moreover, it also showed an average speed up of 1.19x and 1.26x for ß =0.25,N = 10000 and ß =0.35,N = 10000 respectively. As in the frst part of thisexperiment,it canbe seen also from Figure18, thatthe behavior of CUDA-DClus, with respect to that of DClustR, becomes better as the size of the collection grows; in this way, we can say that CUDA-DClus also scales well as the size of the collection grows. 4.3 Memory use evaluation Although O(|V | + the ), the strategy CUDA-DClus proposes for re­ spatial complexity of both algorithm is Eeß (a) ß =0.25, N = 5000 Gß mory needed to update the clustering each time the col­lection changes. Thus,inthe thirdexperiment,we compare the amount of memory used by CUDA-DClus against that used by DClustR, when processing the changes performed in the second experiment. The amount of connected com­ponent loaded by both algorithms when they are updating the current clustering after changes, is directly proportio­nal to the memory used. Based on this, Figure 19 shows the average numberof connected components (i.e.,Ave. NCC) each algorithm load into system memory, when processing the changes presented in Figure 18, for each parameter con­fguration. As it can be seen from Figure 19, CUDA-DClus consu­mes less memory than DClustR, for each parameter con­fguration, thereby hence resulting the memory usage of CUDA-DClus is respectively 22.43% for ß =0.25 and 42.46% for ß =0.35 less than the one of DClustR. The above mentioned characteristic, plus thefact that CUDA-DClus is also faster than DClustR, makes CUDA-DClus suitable for applications processing large document col- elections. We would like to highlight that in the worst scenario, if the clustering of all the connected components needs to be updated, all the partial connected components will be loaded to system memory and thus, our proposed CUDA-DClus and DClustR will have a similar behavior. Addi­tionally, taking into account the results of experiments in sections 4.1 and 4.2, we can conclude that the strategy pro­posed for reducing the memory usedbyCUDA-DClus does not include anyconsiderable cost in the overall processing time of CUDA-DClus or in its accuracy. 5 Conclusions In this paper, we introduced two GPU-based parallel versi­ons of the OClustR and DClustR clustering algorithms, na­mely CUDA-OClus and CUDA-DClus, specifcally tailo­red for document clustering. CUDA-OClus proposesastra­tegy in order to speed up the most time consuming steps of OClustR. This strategy is reused by CUDA-DClus in or­der to speed up the most time consuming steps of DClustR. Moreover, CUDA-DClus proposes a new strategy for re­presenting should allow to reduce the amount of me­ (d) ß =0.35, N = 10000 Figure 19: Average number of connected components DClustR and CUDA-DClus load into system memory when they are updating the current clustering, using ß = 0.25 and 0.35, for N = 5000 and 10000. allows CUDA-DClus to reduce the amount of memory it needs to use and also it helps CUDA-DClus to avoid re­ presenting the graph e Gß ting the collection of documents. This new representation changes but still keep them updated after each changes, that DClustR uses for represen-building the affected components each time the collection with no extra cost. The proposed parallel algorithms were compared against their original versions, over several standard document collections. The experiments were focused on: (a) as­sess the correctness of the proposed parallel algorithms, (b) evaluate the speed-up achieved by CUDA-OClus and CUDA-DClus with respect to OClustR and DClustR, re-spectively,and(c)evaluatethe memorybothCUDA-DClus and DClusR consumes when they are processing changes. From the experiments, it can be seen that both CUDA-OClus and CUDA-DClus are faster than OClustR and DClustR, respectively, and that the speed up these parallel versions attain do not degrade their accuracy. The experi­ments also showed that CUDA-DClus consumes less me­mory than DClustR, when both algorithms are processing changes over the same collection. Based on the obtained results, we can conclude that both CUDA-OClus and CUDA-DClus enhance the effciencyof OClustR and DClustR, respectively, in problems dealing with a very large number of documents. These parallel al­gorithms could be useful in applications, like for instance news analysis, information organization and profles iden­tifcation, among others. We would like to mention, that even when the proposed parallel algorithms were specif­cally tailored for processing documents with the purpose of using the cosine measure, the strategy theypropose can beeasilyextendedtoworkwithother similarityor distance measures like, for instance, euclidean and manhattan dis­tances. As future work, we are going to explore the use in CUDA-OClus and CUDA-DClus of other types of memo-riesinGPUsuchastexture memory, whichisafaster me­mory than the one both CUDA-OClus and CUDA-DClus are using now. Besides, we are going to evaluate both al­gorithms over more faster GPU cards, in order to have a better insight of the performance of both algorithms when the number of CUDAcores are increased. References [1] E. Bae, J. Bailey, G. Dong,Aclustering comparison measure using density profles and its application to the discovery of alternate clusterings, Data Mining and Knowledge Discovery 21 (3) (2010) 427–471. [2] A.K. Jain,M.N. Murty,P.J. Flynn, Data clustering: a review, ACM computing surveys (CSUR) 31 (3) (1999) 264–323. [3] S. Gregory,Afast algorithm to fndoverlapping com­munities in networks, in: Machine learning and kno­wledge discovery in databases, Springer, 2008, pp. 408–423. [4] J. Aslam, K. Pelekhov,D. Rus, Static and dynamic in­formation organization with star clusters, in: Procee­dings of the seventh international conference on In- L.J. González-Soler et al. formation and knowledge management,ACM, 1998, pp. 208–217. [5] A. Pons-Porrata, R. Berlanga-Llavori, J. Ruiz-Shulcloper, J. M. Pérez-Martínez, Jerartop: A new topic detection system, in: Progress in Pattern Re­cognition, Image Analysis and Applications, Sprin­ger, 2004, pp. 446–453. [6] A. Pérez-Suárez, J. F. Martínez-Trinidad, J. A. Carrasco-Ochoa, J. E. Medina-Pagola, Oclustr: A new graph-based algorithm for overlapping cluste­ring, Neurocomputing 121 (2013) 234–247. [7] A. Pérez-Suárez, J. F. Martínez-Trinidad, J. A. Carrasco-Ochoa, J. E. Medina-Pagola, An algo­rithm based on density and compactness for dynamic overlapping clustering, Pattern Recognition 46 (11) (2013) 3040–3055. [8] L.J.G. Soler,A.P. Suárez,L. Chang,Effcientover-lapping document clustering using gpus and multi-core systems, in: Progress in Pattern Recognition, Image Analysis, ComputerVision, and Applications -19th Iberoamerican Congress, CIARP 2014, Puerto Vallarta, Mexico, November 2-5, 2014. Proceedings, 2014, pp. 264–271. [9] M.W. Berry, M. Castellanos, Surveyof text mining, Computing Reviews 45 (9) (2004) 548. [10] R. Gil-García, A. Pons-Porrata, Dynamic hierarchical algorithms for document clustering,Pattern Recogni­tion Letters 31 (6) (2010) 469–477. [11] J. Sanders, E. Kandrot, CUDA by example: an introduction to general-purpose GPU programming, Addison-WesleyProfessional, 2010. [12] D.B. Kirk,W.H.Wen-mei, Programming massively parallel processors: a hands-on approach, Newnes, 2012. [13] G. Salton,A.Wong, C.-S.Yang,A vector space mo­del for automatic indexing, Communications of the ACM 18 (11) (1975) 613–620. [14] E. Amig J. Gonzalo, J. Artiles,F.Verdejo,A com­parison of extrinsic clustering evaluation metrics ba­sed on formal constraints, Information retrieval 12 (4) (2009) 461–486. [15] A. Lancichinetti, S.Fortunato, J.Kertész, Detecting theoverlapping and hierarchical community structure in complex networks, New Journal of Physics 11 (3) (2009) 033015. Weighted Density Center Clustering Protocol for Wireless Sensor Networks Amira Slimani and Mohammed Redjimi Universite 20 Aout 1955-Faculte des sciences, Departement d’informatique-21000 Skikda, Algeria E-mail: amira201015@hotmail.fr, redjimimed@yahoo.fr Djamel Slimani LIS laboratory, Department of electronics communication, University of Setif 1, 19000 Setif, Algeria E-mail: slimani.djamel@ymail.com Keywords: WSNs, LEACH, clustering, hierarchical protocols Received: December 29, 2016 Wireless sensor networks (WSNs) are often composed of a huge number of micro sensors. Low energy batteries and low processing capabilities characterize these tiny devices. The most critical challenge in these networks is the energy conservation. Hierarchical architecture with clustering of nodes is suitable for solution of many problems in the WSNs and has many benefits as energy efficiency, data aggregation and scalability. Low Energy Adaptive Clustering Hierarchy (LEACH) and its centralized version LEACH-C are the most popular hierarchical protocols. This paper proposes a cluster based routing protocol for WSNs, which is an improvement of LEACH-C. It is based on the weighted density center and energy criteria in the selection of the cluster heads (CH). The proposed protocol provides a significant increasing in network lifetime and more energy efficiency. The simulation results show that the proposed algorithm performs better than LEACH-C, Improved-LEACH and the Distance and Energy Aware LEACH (DE-LEACH) algorithms Povzetek: Prispevek predlaga protokol usmerjanja v brezžičnih senzorskih omrežjih (WSN) kot izboljšanje protokola LEACH-C. Introduction With the recent advancements in wireless communication and miniaturization technology, micro sensors have become possible and popular techniques in military, health, security, commercial and industrial applications [1-4]. These tiny sensors, dispersed in huge numbers and autonomous manner, have a capacity to form self-configured networks [5-6]. In Wireless Sensor Network (WSN) applications. Useful collected data by sensors from environment are then sent to the base station. Therefore, the most challenge of these devices is the energy conservation since they operate with limited non-rechargeable batteries. In large scale WSNs, clustering technique is usually used to provide effectiveness in energy saving and topology stability [7­8]. In cluster-based protocols, nodes are regrouped in clusters in where a leader node is elected as a cluster head (CH). Each CH receives the data transmitted by the other node members of the cluster and in turn transmits them in aggregated form to the base station. So, the CH must be efficiently selected, as it is required to organize activities in cluster. These activities deal with data aggregation, monitoring and scheduling the cluster communication to optimize the energy of sensors and extend the network lifetime [3]. Several protocols have been proposed in literature [9-20] to optimize the operation of sensors and raise the network efficiency. Among these research works, the hierarchical protocols based on network division into clusters and election of a cluster leaders according to metrics are the mostly considered. The Low Energy Adaptive Clustering Hierarchy (LEACH) is the first and the most popular hierarchical protocol, which operates on rounds. In each round, two principal operations are performed. The first; called setup phase, concerns the creation of clusters and the election of their leaders. The second deals with the data transmission to the base station by the CHs [10]. The Low Energy Adaptive Clustering Hierarchy centralized (LEACH-C) is the centralized version of LEACH that operates in rounds as in LEACH and each round is divided into two phases: setup and steady state. It provides the energy conservation efficiency particularly in setup phase where it takes into account the energy level of sensors in selection of CHs, which is not considered in LEACH [11]. After these two popular protocols, many other approaches have been proposed and adopted. Some of them are based on LEACH [12­15]. The Mobile-LEACH (M-LEACH) [12] is a multi-hop version of LEACH, which uses a multi hop to send data to the base station. The Two Levels Low Energy Adaptive Clustering Hierarchy protocol (TL-LEACH) [13] uses two types of CHs: primary and secondary CH to ensure more reliability in the CH. The Improved-LEACH [14] routing communication protocol for wireless sensor network proposes a vice cluster head for each cluster during the communication process. The Distance and Energy Aware LEACH (DE-LEACH) [15] is an improved LEACH routing protocol for WSNs in where the network is divided into two parts according to the distance of sensors from the base station. In the other hand, several routing protocols based on fuzzy logic are proposed for the WSNs as Cluster Head Election mechanism using Fuzzy logic (CHEF) [16].This latter uses two fuzzy descriptors for CH election; the first is based on the residual energy and the second on the distance between nodes having the same radius r. This protocol operates on rounds. The Energy-aware Clustering Protocol using Fuzzy-logic (ECPF) [17] is another protocol based on fuzzy logic, which processes on three operations: initialization, processing and finalization. Other techniques are used for WSNs such as bio-inspired and heuristic-based algorithms. Among them are the Ant Colony Optimization (ACO), the Genetic Algorithms (GA) and the Particle Swarm Optimization (PSO). The ant colony optimization algorithms are among the most popular used techniques in the last few years. [18] Proposed an ACO, which uses a probabilistic approach to choose the best solutions. This approach is based on metrics and the most important one is the pheromone value, which is a variable that defines the quality of path. ’Ants’, in these protocols, are placed in each node transmitter. Many other protocols are proposed for WSNs and they have alternative objectives added to the energy as in [19] where its main goal is the network security. Scalable data coupled clustering for large-scale WSNs algorithm is proposed in [20] in order to improve and achieve high scalability of the network. To reduce the energy consumption and extend the lifetime of sensor network, many research works have proposed a multi-hop communication. However, in the multi-hop algorithms, the intermediate nodes routing cannot be avoided, which in turn may increase the energy consumption. In order to overcome this problem, the approach proposed in this paper considers a transmission mode in where the message is directly transmitted from the cluster members to the cluster head (CH). The CH election is based on the energy criteria and weighted density center (WDC) of cluster nodes. The WDC idea reduces significantly the communication distances among nodes. Consequently, the energy conservation is enhanced. This approach is inspired from LEACH-C protocol. However, when selection of CH is considered, the WDC algorithm chooses the nearest point to all nodes in the cluster and it is elected as cluster head. Then, the data transmission from the nodes of the cluster to the CH is carried out in a single hop and over a short distance. This mode of transmission avoids losses due to long node transmissions and reduces the number of paths and energy consumption in the WSN. The rest of this paper is organized as follows: section 2 describes the network model. Section 3 deals with the energy model. Section 4 presents a description of the proposed algorithm. Sections 5 and 6 present respectively the simulation results and the conclusion. A. Slimani et al. 2 The network model The following properties are used to model a sensors region: • The sensor nodes and the base station are assumed stationary once they are deployed in the environment. • The base station is static, and its location is initially known. • Wireless sensor network includes homogeneous sensor nodes. • Initially, all sensor nodes have the same amount of energy. • The base station is not limited in terms of energy, memory and computing power. • The nodes are eligible to determine its current energy level and location information through GPS service. • All the sensor nodes are immobile and having fixed node identification. • Data aggregation is done at the CH node. The following figure (Figure 1) shows the initial network vision: 3 The energy model The energy consumption model, which is used in the proposed protocol, is presented figure 2. It is the same radio model, which is used in LEACH-C [10]. The transmitter dissipates energy for a transmission of k-bits data, which is given by equation 1 .......... ×..+........ ×..×..2, ..<..0 ......(..,..)={ (1) .......... ×..+........ ×..×..4, .....0 Where ......(..,..): Transmitter dissipated energy .......... : Electronic devices energy : Power amplifier ........ d: Transmission distance d0: is the threshold distance that depends on the environment. k:is the number of transmitted bits. For receiving of ..-bits data by each sensor node, the receiver-dissipated energy ...... is given by equation 2 ......(..,..) = .......... ×.. (2) The proposed WDC-LEACH-C clustering protocol The proposed protocol called Weighted Density Center Clustering Protocol based on LEACH-C (WDC-LEACH­ C) operates in several rounds; each round is divided into two phases. The first is the setup phase, which consists of the cluster formation, CHs election and the scheduling transmission of nodes in each cluster. The scheduling is created by using a Time Division Multiple Access (TDMA) protocol. The CHs election is based on metrics: initial energy, residual nodes energy and the distance’s criteria. This important latter parameter is managed by the WDC of nodes in the same cluster. Therefore, the message is directly transmitted from the cluster member’s to the cluster head (CH). The criteria of WDC is not considered in LEACH-C, Improved-LEACH and DE-LEACH. When the setup phase is completed, the second phase (steady state) begins by a data collecting, aggregation and transmission to the base station. WDC­LEACH-C follows a strategy of clustering in setup phase basing on the remaining energy of sensors, distances among nodes and a single hop communication. The proposed protocol minimizes the communication distance and the number of transmission routes and consequently reduces the energy consumption. Figure 3 Informatica 42 (2018) 245–251 247 shows a comparison between a multi-hop communication and a WDC single-hop communication in terms of paths. The proposed protocol is working as follows: Initially, after deployment of nodes in harsh environments, each sensor sends its information (local information and energy level) to the remote BS. Note that the periodicity of these initial packets is ignored in simulation time. After receiving this information, a division of network into equal regions is carried out to form balanced clusters in terms of nodes number. Once the clusters are formed, a CH should be chosen from each cluster. CH should be the node, which holds the highest energy and the closest to the cluster nodes. The selection of CH is done in the following manner: (..) -Computing the average separation distance ........ from any node i to all nodes of the same cluster using the equation (3) .. 1 (..) = ....... (3) ........ .. ..=1 ..... Where (..) : is the average distance ........ ......: is the distance from the node i to the node j .. : is the total number of cluster nodes. -Choosing the node placed at the center of gravity of the cluster. We call it the weighted density center node (WDC) in equation (4) : .. ........ = (..) min{........ } (4) ..=1 - Selection of nearest nodes to WDC to constitute group G: this can be done by performing the following test { Init G=0; For (i=1…n) (..) If ........ <....h......h...... then ..=..+1; End } Where is approached by the ....h......h...... following formula that selects the perfect group G in each cluster 1 .. (..) . ........ + ..=1 ........ .. (5) ....h......h...... = 2 -Among a group G, selection of the most energized node to be the cluster head (CH). This is done in the following way: 1-Computation of the probability of the remaining energy for each node in a group G. (..) .. ................ .... = (..) .. . .. .............. .. (6) 2-Choosing the node among G that has a maximum energy. This can be done by (7) .. ........ =......{....} ..=1 (7) The CH is chosen to be the node j that has a maximum energy. The figures 4 and 5 depict the WDC operations and the CH election, and the figure 6 shows the general Flowchart of clustering and CHs election. Select nearest nodes to the WDC Figure 4: WDC calculation. Node i near No to the WDC Yes Ignore the node i Select among the group G a node that has more energy as CH Figure 5: CH selection. Figure 6: Flowchart of the clusters formation and CHs election in WDC-LEACH-C. Parameters Values Simulation area (100 m× 100 m) Network size 100 sensors BS Location (50,50) Initial Energy 2 joules Minimum Energy 0.001 joule Maximum duration of the simulation 3600 seconds Data packet size 25 Bytes 5 Simulation and results In this section, the simulation process is carried out on WSNs with 100 nodes where the following parameters (Table 1) are used: In order to validate the performance of the proposed WDC-LEACH-C algorithm, a comparison with LEACH­ C, Improved-LEACH (Impro-LEACH in figures) and DE-LEACH algorithms has been done. In this comparison, five important metrics are considered: 1-Network lifetime: the network lifetime can be defined by three ways FND (First Node Died): is also called stability period, Table 1: Simulation parameters. this is the time interval between the start of simulation and the death of the first node. HLD (Half Node Died): is the time between the start of simulation and the time of death of the half of nodes. LND (Last Node Died): is the time between the start of simulation and the time of death of the last node. 2-Number of alive nodes per round: this will measure the number of alive nodes in each round. 3-Consume energy: represents the energy consumption of nodes over simulation rounds. The main objective of routing protocols is to save the energy of sensors because if the sensor node runs out of energy then it will be dead and its life will be over too, so energy is the key factor needed to be considered in WSNs. 4-The quantity of data received by the BS is measured in Bytes. 5-Each data packets received by the BS corresponds to 25 Bytes. The periodicity of these packets is ignored in simulation time because our clustering algorithm does not include periodic transmissions to the base station as in LEACH –C algorithm. Figure 7 represents the network lifetime in which FND, HND and LND are shown for the LEACH-C, Improved-LEACH, DE-LEACH and for the proposed protocol WDC-LEACH-C. Seen from this figure, death time of the first, half and last nodes in WDC-LEACH-C is latter than that in LEACH-C, Improved-LEACH, and DE-LEACH. Therefore, the WDC-LEACH-C provides a better long life for the network due to the minimization of the number and length of routes, which optimizes the energy consumption of sensors. Figure 8 shows the number of alive nodes per round in network. As we can see from this figure; the vanishing of alive nodes is decreasing slowly in the proposed algorithm WDC-LEACH-C, compared to LEACH-C, Improved-LEACH and DE-LEACH algorithms.When all nodes run out of energy in LEACH-C, Improved-LEACH and DE-LEACH, nodes in WDC-LEACH-C can still run for several additional rounds. Consequently, the network stability and life time in the proposed algorithm is much better with respect to DE-LEACH, Impro-LEACH and LEACH-C algorithms. Informatica 42 (2018) 245–251 249 Figure 8: Number of alive nodes. Figure 9 illustrates the energy consumption of nodes in network. Observed from this figure, we can obtain that our proposed protocol WDC-LEACH-C consumes less energy than LEACH-C, Improved-LEACH and DE­LEACH algorithms over the simulation rounds. This can be explained by the fact that the proposed algorithm provides better clustering and a good choice of CH. Since the CH is the WDC node, short transmission distances are used. Thus, network performance has been greatly improved. Figure 9: Energy consumption of nodes over simulation rounds. Figure 11: Number of packets received by the BS. Table 2 summarizes the obtained results. As we can see from this table, the WDC-LEACH-C outperforms the other algorithms. This performance is in terms of energy consumption, prolonging network lifetime, data and received packets by the BS. Conclusion In this paper, WDC-LEACH-C a single hop communication protocol for homogenous WSNs is proposed. The developed algorithm is based on the A. Slimani et al. weighted density center (WDC) of cluster nodes and on the residual energy in the selection of cluster head (CH). WDC-LEACH-C algorithm ensures better energy saving and a prolonging lifetime of the network compared to LEACH-C, Improved-LEACH and DE-LEACH algorithms. 7 References [1] Akyildiz, I. F., Su, W., Sankarasubramaniam, Y., & Cayirci, E. (2002). Wireless sensor networks: a survey. Computer networks, 38(4), 393-422. [2] Baronti, P., Pillai, P., Chook, V. W., Chessa, S., Gotta, A., & Hu, Y. F. (2007). Wireless sensor networks: A survey on the state of the art and the 802.15. 4 and ZigBee standards. Computer communications, 30(7), 1655-1695. [3] Boyinbode, O., Le, H., & Takizawa, M. (2011). A survey on clustering algorithms for wireless sensor networks. International Journal of Space-Based and Situated Computing, 1(2-3), 130-136. [4] Aseri, T. C. (2014, March). Comparison of routing protocols in wireless sensor network using mobile sink-A survey. In Engineering and Computational Sciences (RAECS), 2014 Recent Advances in (pp. 1­4). IEEE. [5] Afsar, M. M., & Tayarani-N, M. H. (2014). Clustering in sensor networks: A literature survey. Journal of Network and Computer Applications, 46, 198-226. [6] Jiang, C., Yuan, D., & Zhao, Y. (2009) Towards clustering algorithms in wireless sensor networks-a survey. In Wireless Communications and Networking Conference, WCNC 2009. IEEE, pp. 1­6, doi: 10.1109/WCNC.2009.4917996 [7] Kawadia, V., & Kumar, P. R. (2003, March). Power control and clustering in ad hoc networks. In INFOCOM 2003. Twenty-Second Annual Joint Conference of the IEEE Computer and Communications. IEEE Societies (Vol. 1, pp. 459­469). IEEE. [8] Karenos, K., Kalogeraki, V., & Krishnamurthy, S. V. (2008). Cluster-based congestion control for sensor networks. ACM Transactions on Sensor Networks (TOSN), 4(1), 5. [9] Ahmad, H., & Kohli, N. (2015, November). A survey on various static and mobile base stations and wireless charging of sensor nodes in WSNs. In Communication Networks (ICCN), 2015 International Conference on (pp. 16-22). IEEE. [10] Heinzelman, W. R., Chandrakasan, A., & Table 2. Simulation results. Protocol First node died in round Last node died in round Total energy consumption (Joules) Total quantity of data sent to the BS (Bytes) Total number of packets sent to BS LEACH-C 382 569 376 79845 3193 Impro-LEACH 429 600 305 101434 4057 DE-LEACH 511 732 250 147599 5903 WDC-LEACH-C 581 801 217 167132 6680 Balakrishnan, H. (2000, January). Energy-efficient communication protocol for wireless microsensor networks. In System sciences, 2000. Proceedings of the 33rd annual Hawaii international conference on (pp. 10-pp). IEEE. [11]Heinzelman, W. B., Chandrakasan, A. P., & Balakrishnan, H. (2002). An application-specific protocol architecture for wireless microsensor networks. IEEE Transactions on wireless communications, 1(4), 660-670. [12]Mhatre, V., & Rosenberg, C. (2004, June). Homogeneous vs heterogeneous clustered sensor networks: a comparative study. In Communications, 2004 IEEE International Conference on (Vol. 6, pp. 3646-3651). IEEE. [13]Loscri, V., Morabito, G., & Marano, S. (2005, September). A two-level hierarchy for low-energy adaptive clustering hierarchy (TL-LEACH). In Vehicular Technology Conference, 2005. VTC-2005-Fall. 2005 IEEE 62nd (Vol. 3, pp. 1809-1813). IEEE. [14]Zhao, F., Xu, Y., & Li, R. (2012). Improved LEACH routing communication protocol for a wireless sensor network. International Journal of Distributed Sensor Networks, 2012. [15]Kumar, S., Prateek, M., Ahuja, N.J., & Bhuchan, B. (2014, February). DE-LEACH Distance and Energy Aware LEACH. International Journal of Computer Applications, 88(9), 36-42. [16]Runkler, T. A. (2008). Wasp swarm optimization of the c-means clustering model. International Journal of Intelligent Systems, 23(3), 269-285. [17]Pinto, P., Runkler, T. A., & Sousa, J. M. (2005). Wasp swarm optimization of logistic systems. In Adaptive and Natural Computing Algorithms (pp. 264-267). Springer, Vienna. [18]Bonabeau, E., Dorigo, M., & Theraulaz, G. (1999). Swarm intelligence: from natural to artificial systems (No. 1). Oxford university Press. [19]Engelbrecht, A. P. (2006). Fundamentals of computational swarm intelligence. John Wiley & Sons. [20]Kim, J. M., Park, S. H., Han, Y. J., & Chung, T. M. (2008, February). CHEF: cluster head election mechanism using fuzzy logic in wireless sensor networks. In Advanced communication technology, 2008. ICACT 2008. 10th international conference on (Vol. 1, pp. 654-659). IEEE. [21]Taheri, H., Neamatollahi, P., Younis, O. M., Naghibzadeh, S., & Yaghmaee, M. H. (2012). An energy-aware distributed clustering protocol in wireless sensor networks using fuzzy logic. Ad Hoc Networks, 10(7), 1469-1481. [22]Sharma, T., Kumar, B., Berry, K., Dhawan, A., Rathore, R. S., & Gupta, V. (2014, April). Ant based cluster head election algorithm in wireless sensor network to avoid redundancy. In Communication Systems and Network Technologies (CSNT), 2014 Fourth International Conference on (pp. 83-88). IEEE. Informatica 42 (2018) 245–251 251 [23] Eschenauer, L., & Gligor, V. D. (2002, November). A key-management scheme for distributed sensor networks. In Proceedings of the 9th ACM Conference on Computer and Communications Security (pp. 41-47). ACM. [24] Chidean, M. I., Morgado, E., del Arco, E., Ramiro-Bargueno, J., & Caamao, A. J. (2015). Scalable data-coupled clustering for large scale WSN. IEEE Transactions on Wireless Communications, 14(9), 4681-4694. Persistent Homology and Machine Learning Primož Škraba Artifcial Intelligence Laboatory†, Jozef Stefan Institute E-mail: primoz.skraba@ijs.si Keywords:persistent homology, topological data analysis, overview Received: March 27, 2018 In this position paper, we present a brief overview of the ways topological tools, in particular persistent homology, has been applied to machine learning and data analysis problems. We provide an introduction to the area, including an explanation as to how topology may capture higher order information. We also provide numerous references for the interested reader and conclude with some current directions of rese­arch. Povzetek:Vtem ˇ clankupredstavljamopregledtopološkihorodij,predvsemvztrajno homologijo,kijeupo­rabna na podroˇcju strojnega uˇcenja in za analizo podatkov. Zaˇcnemo z uvodom v podroˇcje in razložimo, ˇ kako topologija lahko zajame informacije višjega reda. Clanek vsebuje tudi reference na pomembna dela za zainteresiranega bralca. Zakljuˇ cimo s trenutnimi smernicami raziskav. 1 Introduction Topology is the mathematical study of spaces via connecti­vity. The application of these techniques to data is aptly na­med topological data analysis (TDA). In this paper,we pro­vide an overview of one such tool called persistent homo­logy. Since these tools remain unfamiliar to most computer scientists, we provide a brief introduction before providing some insight as to whysuch tools are useful in a machine learning context.We provide pointers tovarious successful applications of these types of techniques to problems where machine learning has and continues to be used. We begin with a generic TDApipeline (Figure 1). The inputisa setof samples, usuallybut notalways embedded in some metric space. Based on the metric and/or additio­nal functions (such as density), a multiscale representation of the underlying space of data is constructed. This goes beyond considering pairwise relations to include higher-order information. Persistent homology is then applied. This is a tool developed from algebraic topology, which summarizes the whole multiscale representation compactly in the form of a persistence diagram. This compact repre­sentation can then be applied to various applications. The goal of this paper is to provide a brief overview and introduce the main components in the TDApipeline. 2 Simplicial complexes Representationsofthe underlying space arebuiltup simple pieces glued together. There are manydifferent approaches to this, however the simplest is perhaps the simplicial com-plex.A simplex is the convex combination of k points. A † ARRS ProjectTopRep N1-0058 Figure1: TheTDApipeline -takinginapointsinin sime metric space along with potentially other information, the data is turned into a compact representation called a persistence diagram. This summary can then be input into machine learning algorithms rat­her than the raw point cloud. single point contains only itself, an edge is the convexcom­bination of two points, three points make a triangle, four points a tetrahedron and so on (see Figure 2). More gene­rally, a k-dimensional simplex is the convex combination of (k + 1) points. Just as an edge in a graph represents a pairwise relationship, triangles represent ternary relations­hips and higher dimensional simplices higher order relati-ons.Agraphis anexampleofa one-dimensional complex, as it represents all pairwise information -all higher order information is discarded. As we include higher dimensi­onal simplices, we include more refned information yiel­ding more accurate models. Note that these models need to not exist in an ambient space (i.e. may not be embedded), but rather represents connectivity information. The geome­tric realization of simplicial complexes has a long history of studyin combinatoricsbut wedo not addressit here. There are three main obstacles to this type of modeling. The frst is lack of data. While it may be counterintui­tive, in the age of big data we are often still faced with a lack of data. This is due to the non-uniformity and non-homogeneity of data. It may not makesense to consider 10­way relationships, if this data is only available for a small Figure 2: Simplicies come in different dimension. From left to right, a vertex is 0-dim, an edge is 1-dim, a triangle 2-dim and a tetrahedron is 3-dim. subset of data. The second is computation. As we consi­der higher order relationships, there is often a combinato­rial blow-up as one must consider all k-tuples, leading to preprocessing requirements which are simply not feasible. The fnal obstacle is interpretability. While we can under­stand a simplex locally, understanding the global structure becomes increasingly challenging. This is the starting point for the tools we discuss below. Much of the effort of machine learning on graphs is trying to understand the qualitative properties of an underyling graph. This is often done by computing statistical features on the graph: degree distributions, centrality measures, di­ameter, etc. To capture higher order structure, we require a different set of tools. First, we note that a collection of simplices ft together. Just as in a graph, edges can only meet at an edge, simplices can only be glued together al­ong lower dimensional simplices, e.g. triangles meet along edges or at a vertex. This represents a constraint on how simplebuilding blocks (e.g. simplices) canbe glued toget­her to form a space. While this does not seriously limit the resulting spaces which can be represented, it does give us additional structure. The starting point for the introduction is to describe the gluing map, called the boundary operator. For each k-simplex it describes the boundary as a collection of k - 1 simplices. For example, the boundary of an edge consists of its two end points, the boundary of a triangle consists of its three edges (Figure 3). This can be represented as a matrix with the columns representing k-simplices and the rows k - 1 simplices, which we denote .k. The k-dimensional homology can be defned as ker .k Hk = im .k+1 The kernel is simply the collection of k-simplices which form the nullspace of the matrix which correspond to cycles (note that this agrees with the notion of graph-theoretic cy­cles).Wethe disregard all suchcycles which bound regions flled-in by higher dimensional simplices. What remains is the numner of k-dimensional holes in the space. Spe­cifcally, 0-dimensional homology corresponds to the num­ber of connected components, 1-dimensional homology the number of holes and so forth. The k-th Betti number, ßk is the number of independent such features. This is analo­gous to the rank of a matrix describing the number of basis elements a vector space has. This yields a qualitative des­criptionofthe space.Fora more completeintroductionto homology,we recommend the bookbyMunkres [24] or the more advanced book by Hatcher[18]. An alternative intor- Figure 3: Simplicies are glued together in a specifc way with each simplex is glued to lower dimensional simplices, called its boundary. Here we show an edge has2 verticies as its boundary and a triangle has three edges as its boundary. duction which also includes persistent homology (descri­bed in the following section) can be found in Edlesbrunner and Harer[13]. Our goal here is to point out the intuition behind simplicial complexes and one approach to descri­bingthem qualitatively.Wedonotethatthe algorithmsand implementations are readily available [2, 19, 25, 23] and can often be interpreted through linear algebra. 3 Persistent homology One problem with homology and topological features in general is that they are unstable. Adding a point to a space changes the number of components and the correspoding Betti number. This would make it seems as though this technique were not suitable for the study of data. Akey insight from [14, 39],isthat we need not look ata single space but rather a sequence of spaces, called a fltration. This is an increasing sequence of nested spaces, which ap­pears often when dealing with data. Ř. X0 . X1 . ... . XN For example a weighted graph can be fltered by the edge weights. Perhaps the most ubiquitous example is a fnite metric space, where the space is a complete graph and the weights are distances. This occurs whenever the notion of a “scale" appears, Persistent homologyis the study of how qualitative featuresevolveover parameter choices.Forex-ample, the number of components is monotonically decre­asing as we connect points which are increasinglyfaraway. This is infact precisely single linkage clustering. Higher dimensional features such as holes can appear and disap­pear at different scales. Thekeyinsightis that theevolutionof featuresover pa­rameter choices can be encoded compactly in the form of a barcode or persistence diagram (Figure 4). We do not go into the algebraic reasons why this exists, rather we con­centrate on its implications. An active research area has been toextend this to higher dimensional parameter spaces [6,22,34],buthas remaineda challenging area. We refer the reader to [13] for introductions to persistent homology and itsvariants.For the next section, rather than considera persistence diagram rather thana barcode. Here eachbaris mapped to a point with the starting point of the bar as the x-coordinate and the end point as the y-coordinate. Consider a function on a simplicial complex, f : K › R where we defne the fltration as the sublevel set f-1(-.,.]. That is, we include all simplices with a lower function value. As we increase ., the set of sim­plicies with a lower function value only grows, hence we only add simplices. Therefore, we obtain an increasing sequence of topological spaces, i.e. a fltration. Defne X. := f-1(-.,.], then X.1 . X.2 . · ·· . X.n .1 . .2 . · ·· . .n As anotherexampleina metric space,we includealledges which represent a distance less than .. Consider a pertur­bed metric space, giving rise to a different function g. The following theorem establishes stability -that if the input (in this case, the function) does not change much, the output should not change much. Theorem1 ([11]). Let K be two simplicial complexes with two continuous functions f, g : X › R. Then the persis­tence diagrams Dgm(f) and Dgm(g) for their sublevel set fltrations satisfy dB(Dgm(f), Dgm(g))) . ||f - g||.. where Dgm(·) represents the persistence diagram (i.e. a topological descriptor which is a set of points in R2)and dB(·) represents bottleneck distance. This is the solution to the optimization which constructs a matching between the points in two diagrams which minimizes the maximum distance between matched points. While it is diffcult to overstate the importance of this result, it does have some drawbacks. In particular the bound is in terms of the .­norm which in the presence of outliers can be very large. Recentlythisresulthasbeen specializedtoWassersteinsta­bility, which is a much stronger result (albeit in a more li­mited setting). Theorem2 ([36]). Let f, g : K › R be two functions. Then Wp(Dgm(f), Dgm(g)) .kf - gkp. Wasserstein distance is common in the machine learning and statistics literature as it is a natural distance between probability distributions. This recent result indicates that the distances between diagrams is indeed more generally stable and so suitable for applications. Stability has be­come an area of study in its own right and we now have a good understanding of the types of stabilty we can expect. The literature is too vast to list here so we limit ourselves to a few relevant pointers [3, 8]. 4 Topological features Here we describe some applications of persistence to ma­chine learning problems. Thekeyideaisto use persistence diagrams as feature vectors as input further machine lear­ning algorithms, There are several obstacles to this. The most important is that the space of persistence diagrams is quite pathological. The frst approach to move around Informatica 42 (2018) 253–258 255 this are persistence landscapes [4]. This lifts persistence diagrams into a Hilbert space which allows them to be fed into most standard machine learning algorithms. This has been followedupby rank functions [33], as well as several kernels [30], More recently, there has been work on lear­ning optimal functions of persistence diagrams using deep learning [20]. There has also been signifcant work on the statistical properties of persistence diagrams and landscapes [16], in­cluding bootstrapping techniques [9]. These techniques have been applied to a number of ap­plication areas. Perhaps mostextensiveisin geometry pro­cessing. Combined with local features such as curvature or features based on heat kernels, different geoemtric struc­ture can be extractedincluding symmetry [26], segmenta­tion [35], and shape classifcation and retrival [7]. Another application area where persistence diagrams have been found to be informative are for biology, especi­ally for protein docking [1] and modelling pathways in the brain [17]. The fnal application area we mention is ma­terial science. This is an area where machine learning has notyet been appliedextensively. Partiallyduetothefact that the input is of a signifcantly different favor than that which is typical in machine learning. For example, stan­dard image processing techniques do not work well with scientifc images such as electron microscope images. By using topological summaries, the relevant structure is well-captured [32, 21]. This area is still in the early stages with many more exciting developments expected. We conclude this section by noting that persistence di­agrams are not the only topological features which have been applied. Originally, the Euler curve was applied to fMRIs [38]1. This feature has been extensively studied in the statistics literature,butisprovablyless informativethan persistence diagrams -although it isfar more computatio­nally tractable. In addition to fMRI, it has been applied to various classifcation problems [31]. 5 Other applications In additon to providing a useful summary and features for machine learning algorithms, a second direction of inte­rest is the map back to data. This inverse probelm is very diffcult and can often be impossible in general. Nonethe-less,the situationis often not as hopeless asitwould seem. Some of the frst work in this direction is re-interpreting single linkage clustering through the lens of persistence [10]. While it is well known that single linkage clusters are unstable, it is possible to use persistence to show that there exist stable parts of the clusters and a “soft" clustering al­gorithm can be developed to stabilize clusters, where each data point is assigned a probability that it is assigned to a given cluster.Acurrent directionof researchisto fnd simi­lar stable representations in the data for higher dimensional structures (such as cycles). 1We note that this is where the termtopological inference frst used Figure 4: Persistence in a nutshell. Given input points (left), we compute a barcode (middle). which shows how long features live. Theredshowsthe lifetimesofwhen components merge, whilethebluebarsshow 1-dimensional holes.We canmapeachbartoapoint by taking the start and end as the x and y coordinates respectively giving us the persistence diagram (right). Here we see that the big wholein the middleof the data set appears asa prominent feature (the blue dotfar from the diagonal on the right). A related problem is one of parameterization. That is, fnd intrinsic coordinates describing the data, extending successful techniques in dimensionality reduction, This in­cludes linear methods such as PCA and MDS as well as non-linear methods such as ISOMAP and LLE. The frst such work coordinizaed the space of textures using a Klein bottle as the underlying model [28] -a topological model found a few years prior [5]. This was however built by hand. The frst class of general methods is frst to map ci­crular coordinates to data [12]. This is particularly useful when dealing with recurrence in time-varying systems. Re­currence (including periodicity)is naturally modeledby an angle, Combining persistence with least-squares optimiza­tion provides an automatic pipeline to fnding such coor­dinates. This was applied to characterizing human moti­ons such as different walks and other activities [37]. Furt­her work has shown how to construct coordainte systems for higher dimensional structures based on the projective plane [27]. The fnal direction we consider is to encode topological constraints in machine learning algorithms. In [29] topo­logical priors were used to aid in parameter selection. For example, the reconstruction of a racetrack should have one component and one hole (the main loop). Computing the persistence with respect to a reconstruction parameter (e.g. bandwithofakernel) can allowusto choosea parameter value where the reconstruction has the desired topological “shape." The encoding of topological constraints is still in theveryearly stagesbuthasthe potentialtoprovideanew type of regularization to machine learning techniques. 6 Discussion Topological data analysis and applications of topology are stillin their early stages.Variouseffortstobridgethegap between algebraic topology and statistics (and probability) has made rapid progress over the last few years which has culminated in a dedicated R-package [15]. At the same time, increasingly effcient software exists for computing persistent homologyexists, wherenowitis feasibleto con­sider billions of points in low dimensions. This is increa­singly bridging thegap between theory and practice. The area has undergone rapid development over the last 10 years and is showing no signs of slowing down. In terms of theory, the primary question drinving the community is the notion of multi-dimensional or multi-parameter persis­tence, where the computational obstacles are much more daunting. Nonetheless, progress is being made. Success promises to further reduce the need and dependence on pa­rameter tuning. The combination of deep learning techniques with to­pological techniques promises to provide new areas of ap­plications as well as potentially performance. These met­hods are primarily complementary allowing them tobuild on each other. In conclusion, while obstacles remain, the inclusion of topological techniques into the machine lear­ning toolbox is rapidly making progress. References [1] Pankaj K Agarwal, Herbert Edelsbrunner, John Ha-rer, and Yusu Wang. Extreme elevation on a 2­manifold. Discrete & Computational Geometry, 36(4):553–572, 2006. [2] U Bauer. Ripser. https://github.com/Ripser/ripser, 2016. [3] Ulrich Bauer and Michael Lesnick. Induced matc­hings of barcodes and the algebraic stability of per­sistence. In Proceedings of the thirtieth annual sym­posium on Computationalgeometry, page 355.ACM, 2014. [4] Peter Bubenik. Statistical topological data analysis using persistence landscapes. TheJournalof Machine Learning Research, 16(1):77–102, 2015. [5] Gunnar Carlsson, Tigran Ishkhanov, Vin De Silva, and Afra Zomorodian. On the local behavior of spa­ces of natural images. International journal of com­puter vision, 76(1):1–12, 2008. [6] Gunnar CarlssonandAfra Zomorodian.Thetheoryof multidimensional persistence. Discrete & Com­putational Geometry, 42(1):71–93, 2009. [7] Frédéric Chazal, David Cohen-Steiner, Leonidas J Guibas, Facundo Mémoli, and Steve Y Oudot. Gromov-hausdorffstable signatures for shapes using persistence. In Computer Graphics Forum, vo­lume 28, pages 1393–1403. Wiley Online Library, 2009. [8] FrédéricChazal,VinDeSilva,MarcGlisse,andSteve Oudot. The structure and stability of persistence mo­dules. arXiv preprint arXiv:1207.3674, 2012. [9] Frédéric Chazal, BrittanyTereseFasy,Fabrizio Lecci, Alessandro Rinaldo, Aarti Singh, and LarryWasser-man. On the bootstrap for persistence diagrams and landscapes. arXiv preprint arXiv:1311.0376, 2013. [10] Frédéric Chazal, LeonidasJGuibas, SteveYOudot, and Primoz Skraba. Persistence-based clustering in riemannian manifolds. Journalof theACM(JACM), 60(6):41, 2013. [11] David Cohen-Steiner, Herbert Edelsbrunner, and John Harer. Stability of persistence diagrams. Dis­crete & Computational Geometry, 37(1):103– 120, 2007. [12] Vin De Silva, Dmitriy Morozov, and Mikael Vejdemo-Johansson. Persistent cohomology and ci­rcular coordinates. Discrete & Computational Geometry, 45(4):737–759, 2011. [13] Herbert Edelsbrunner and John Harer. Computational topology: an introduction. American Mathematical Soc., 2010. [14] Herbert Edelsbrunner, David Letscher, and Afra Zo­morodian.Topological persistence and simplifcation. In Foundations of Computer Science, 2000. Procee­dings. 41st Annual Symposium on, pages 454–463. IEEE, 2000. [15] BrittanyTerese Fasy, Jisu Kim,Fabrizio Lecci, and Clément Maria. Introduction to the r package tda. arXiv preprint arXiv:1411.1830, 2014. [16] BrittanyTereseFasy,Fabrizio Lecci, Alessandro Ri­naldo, Larry Wasserman, Sivaraman Balakrishnan, Aarti Singh, et al. Confdence sets for persistence di­agrams. The Annals of Statistics, 42(6):2301–2339, 2014. [17] Margot Fournier, Martina Scolamiero, Mehdi Gholam-Rezaee, Hélene Moser, Carina Ferrari, Philipp S Baumann, Vilinh Tran, Raoul Jenni, Luis Alameda, Karan Uppal, et al. M3. topological analyses of metabolomic data to identify markers of early psychosis and disease biotypes. Schizophrenia Bulletin, 43(suppl_1):S211–S212, 2017. [18] Allen Hatcher. Algebraic topology. 2002. [19] Gregory Henselman and Robert Ghrist. Matroid fltrations and computational persistent homology. arXiv preprint arXiv:1606.00199, 2016. [20] Christoph Hofer, Roland Kwitt, Marc Niethammer, and Andreas Uhl. Deep learning with topological sig­natures. In Advances in Neural Information Proces­sing Systems, pages 1633–1643, 2017. [21]Yongjin Lee, SenjaDBarthel,Pawe D otko,SMo­hamad Moosavi, Kathryn Hess, and Berend Smit. Quantifying similarity of pore-geometry in nanopo­rous materials. Nature Communications, 8, 2017. [22] Michael Lesnick. The theory of the interlea­ving distance on multidimensional persistence mo­dules. Foundations of Computational Mathematics, 15(3):613–650, 2015. [23] Dmitriy Morozov. Dionysus. Software available at http://www. mrzv. org/software/dionysus, 2012. [24] JamesR Munkres. Elements of algebraic topology, volume 4586. Addison-WesleyLongman, 1984. [25] Vidit Nanda. Perseus: the persistent homology soft­ware. Software available at http://www. sas. upenn. edu/~ vnanda/perseus, 2012. [26] Maks Ovsjanikov, Quentin Mérigot,VioricaP˘au­atr˘cean, and Leonidas Guibas. Shape matching via quotient spaces. In Computer GraphicsForum, vo­lume 32, pages 1–11.WileyOnline Library, 2013. [27] JoseAPerea. Multi-scale projective coordinates via persistent cohomology of sparse fltrations. arXiv pre­print arXiv:1612.02861, 2016. [28] JoseA Perea and Gunnar Carlsson. A klein-bottle-based dictionary for texture representation. Internati­onal journal of computer vision, 107(1):75–97, 2014. [29] Florian T Pokorny, Carl Henrik Ek, Hedvig Kjell­str, and Danica Kragic. Topological constraints and kernel-based density estimation. Advances in Neural Information Processing Systems, 25, 2012. [30] Jan Reininghaus, Stefan Huber,Ulrich Bauer,and Ro­land Kwitt.Astable multi-scalekernel for topological machine learning. In Proceedings of the IEEE con­ference on computer vision and pattern recognition, pages 4741–4748, 2015. [31] Eitan Richardson and Michael Werman. Effcient classifcation using the euler characteristic. Pattern Recognition Letters, 49:99–106, 2014. [32] Vanessa Robins, Mohammad Saadatfar, Olaf Delgado-Friedrichs, and Adrian P Sheppard. Per­colating length scales from topological persistence analysis of micro-ct images of porous materials. Water Resources Research, 52(1):315–329, 2016. [33]Vanessa Robinsand KatharineTurner. Principal com­ponent analysis of persistent homology rank functi­ons with case studies of spatial point patterns, sphere packing and colloids. Physica D: Nonlinear Pheno­mena, 334:99–117, 2016. [34] Martina Scolamiero, Wojciech Chachski, Anders Lundman, Ryan Ramanujam, and Sebastian Öberg. Multidimensional persistence and noise. Founda­tions of Computational Mathematics, 17(6):1367– 1406, 2017. [35] Primoz Skraba, Maks Ovsjanikov, Frederic Chazal, and Leonidas Guibas. Persistence-based segmen­tation of deformable shapes. In Computer Vision andPattern RecognitionWorkshops (CVPRW), 2010 IEEE Computer Society Conference on, pages 45–52. IEEE, 2010. [36] Primoz Skraba and KatharineTurner.Wasserstein sta­bility of persistence diagrams. submitted to the Sym­posium of Computational Geometry 2018. [37] MikaelVejdemo-Johansson, FlorianT Pokorny, Pri­moz Skraba, and Danica Kragic. Cohomological lear­ning of periodic motion. Applicable Algebra in Engi­neering, Communication and Computing, 26(1-2):5– 26, 2015. [38]KeithJWorsley, Sean Marrett, Peter Neelin, AlainC Vandal, KarlJ Friston, AlanC Evans, et al. A uni­fed statistical approach for determining signifcant signals in images of cerebral activation. Human brain mapping, 4(1):58–73, 1996. [39] Afra Zomorodian and Gunnar Carlsson. Computing persistent homology. Discrete & Computational Geometry, 33(2):249–274, 2005. Research on Intelligent English Oral Training System in Mobile Network Fen Zhu Foreign Languages School of Luoyang Institute of Science and Technology, Luoyang, Henan, 471000, China E-mail: fenzhu_lit@163.com Technical Paper Keywords: mobile network, Android system, spoken English, resonance peak, evaluation Received: March 13, 2018 With the rapid development of mobile networks, mobile learning, as a new learning form, is gradually accepted by people. Based on the Android mobile platform, this paper designed a spoken English training system that could be applied to mobile network equipment from the aspects of speech recognition, pronunciation scoring and function setting. Based on the characteristics of the Android system, this paper selected the MEL cepstrum coefficient as the feature parameters to speech recognition, and introduced the dynamic time neat algorithm as the matching algorithm of the speech recognition pattern to make speech recognition more suitable for mobile Internet devices. Besides, the voice formant was used as a reference for oral scores and the scoring method based on single reference template was adopted. Finally, the spoken English training system was developed under the eclipse integration environment. The test results showed that the success rate of voice input was over 98%, and the accuracy rate of spoken voices of monophthong words, diphthong words and polysyllabic words was 97.15%, 94.96% and 93.62% respectively, suggesting that the system could accurately input and score English learners’ spoken English, and assist English pronunciation. Povzetek: Prispevek se ukvarja z mobilnim učenjem angleščine na sistemih z Androidom.. Introduction With the deepening of economic globalization, Alamer et al. [2] designed and develop mobile Web communication between China and other counties has technology and API based lightweight language learning become increasingly frequent. Therefore English which management system. The system aimed to allow is the most extensively applied language worldwide has language students to view and download learning content gradually been an indispensable tool in daily life and on their phones and complete interactive tasks designed work, and moreover many English training institutions by teachers. Milutinovic [3] et al. proposed a mobile and learning tools have emerged. But the traditional adaptive language learning model, whose main goal was learning mode, i.e. face-to-face teaching mode in training to improve the mobile language learning process using institutions, usually cannot achieve a good result in adaptive technology. The proposed model was designed spoken English, which contributes to the large difference to take advantage of unique opportunities to transfer of pronunciation between English and Chinese. People learning content in real learning situations. Taking who grow up in Chinese environment will make the Android smartphone as the application platform, this mistake of pronunciation unconsciously when learning study aimed to build an intelligent spoken English oral English. Moreover English teachers who have training system that could be used on mobile network correct pronunciation and are able to guide pronunciation devices. are lack of in China. Time and environment for spoken English practice are also not enough. 2 Mobile learning With the rapid development of mobile information Mobile learning [5] refers to the use of portable technology, mobile network terminals such as mobile communication equipment and technology so that smartphone and panel personal computer have almost learners can choose their preferred way to study any time covered every aspect of our life. Smartphone based oral and place. Compared with the time fixed English English training software is more convenient and classroom learning mode, mobile learning has practical compared to the traditional teaching mode and extensiveness, timeliness and interactivity features, can effectively avoid the shortcomings of the traditional giving learners a more relaxed and pleasant learning teaching mode. Mobile network device based mobile experience. In addition, the multimedia combination of learning has been extensively studied. Wang et al. [1] audio, text, video, image and animation makes mobile found that computer corpus based teaching mode was learning more vivid. Mobile language learning enables more effective than the traditional teaching mode. learners to have more learning options, to make full use of fragmented time, and to be efficient and flexible. 3 Intelligent spoken English training system design 3.1 Speech recognition 3.1.1 Speech signal preprocessing (1) Speech signal digitization Speech signal can be analyzed and processed by computer through digital conversion. This paper uses the headset of Android phone as the input device of voice signal, and uses the Audio Record Wizard [6] of Android system to collect the underlying data. According to Nyquist frequency theorem, the sampling frequency of 7000 Hz is used to collect the speech signal. (2) Pre-emphasis In order to eliminate the influence of mouth and nose radiation, speech signals are usually pre-emphasized by a first-order high-pass filter [7]. Pre-emphasis refers to improving the resolution of the high-frequency part of speeches by emphasizing the high-frequency part of speeches based on the difference between signal properties and noise properties. Usually pre-emphasis is realized using first-order FIR high-pass digital filter [16]. The formula used by the filter is shown below. , (1) Where refers to the pre-emphasis coefficient and is set to 0.98 in this study. Set the speech signal at the nth time point to be , then the weighted signal is: , (2) Where refers to the speech signal after pre- emphasis and refers to the last filter output value. (3) Windowing processing In order to ensure continuous and complete voice signals in each frame, a window function is generally multiplied before processing each frame of speech [8]. This paper uses Hamming window function to window the signal, with the formula as follows: . (3) (4) Endpoint detection According to the characteristics of the Android platform, this paper uses the combination of short time average energy [9] and short-time zero-crossing rate to detect the endpoint. The short-term average energy is calculated as follows: set the short time average energy and frame length of the n-th frame of speech signal to and respectively, then the calculation formula is as follows: (4) F. Zhu According to the size of the short-term energy, the learner's voice and noise can be distinguished, and high energy signal is the speech signal. However, this method is less stable under low SNR conditions. Therefore, it is necessary to use short-time zero-rate method. Set the speech signal to be , then the short-time zero-crossing rate is: , (5) Where refers to the sign function. According to the low frequency band of voiced sound energy and the high frequency band of voiceless sound energy, the zero-crossing rate of the speaker is stable relative to the ambient noise and the sound segment can be clearly identified. 3.1.2 Extraction of speech signal features Feature extraction [17] was performed after the preprocessing to highlight the data features of pattern matching, improve recognition rate, compress information and reduce computation load and storage. The commonly used feature parameters include Mel-frequency cepstral coefficient (MFCC) which has strong recognition performance and anti-noise capacity, linear predictive coefficient which has small computer load but general efficacy and accent sensitivity parameter which has favorable performance in recognition the middle frequency band of signals. In this system, Mel Frequency Cepstrum Coefficient (MFCC) [10] is used as the characteristic parameter of oral training. MEL scale and frequency have the following relationship: , (6) Where refers to the actual frequency of the signal. Fourier transform [11] is performed on each frame of speech signal after preprocessing to obtain the signal spectrum. Then, the spectrum square is cut off, Mel band-pass filter is applied for filtering, all of the filter outputs undergo logarithm calculation, and then discrete cosine transform is made on DCT to obtain MFCC, the process is shown in Figure 1. , (7) Where L refers to the number of filters, refers to the output of each triangle filter, N refers to the length of each frame, and p refers to the order of parameters. Figure 1: MFCC feature extraction process. 3.2 Speech signal pattern matching In this paper, Dynamic Time Warping (DTW) [12] is used to match the characteristics of speech signals. Firstly, set the eigenvector sequence of the standard template to be , where M is the total speech frame number, m is the time series label of the signal frame and is the eigenvector of the m-th frame. The eigenvector sequence of the speech test template is , where Q is the number of frames, n is the sequence number of the speech in the template and is the eigenvector of the nth frame. The similarity between the test template and the standard template is represented by vector distance, and the similarity decreases with the increase of vector distance. Euclidean distance [13] is used to represent the distance between and , as follows: , (8) Where refers to the eigenvector of the i-th dimension of , refers to the eigenvector of the i-th dimension of . The dynamic time warping is to map the time axis n of the speech test template to the time axis m of the standard template to obtain the minimum vector distance of the template, as follows: (9) Dynamic time warping generally requires finding a path which goes through each intersection with the distance measure sum of the intersection at the path minimized. Generally, constraint conditions are given: Boundary condition: (10) Continuity condition: (11) With the above two conditions met and the frame distance accumulated sum the minimum, the optimal path is sought as follows: starting from (1, 1), backstepping is repeated until (N, M) to find the optimal matching path. refers to template distance of the matching path and the minimum matching distance is , which is taken as the measuring criterion for the similarity matching degree between templates. 3.3 Pronunciation scoring Firstly, the average matching distance of frames is calculated: , (12) Where refers to the total matching distance of the test templates, N refers to the frame length of the test templates. When selecting the average frame matching distance, the effect of the speech length is eliminated. In terms of scoring, this paper proposes a scoring method based on the single reference template. The range of pronunciation score is 0~100, and the scoring method is as follows: , (13) Where d refers to the average frame matching degree, and e and f are the scoring parameters obtained based on the experience of spoken English teachers and matching distance. 3.4 Scoring parameter selection In this study, the formant was taken as a criterion to evaluate the learner's spoken language pronunciation, and the learners' spoken English pronunciation quality was judged by the similarity contrast between the pronunciation formant of the test model and the standard model. Formant refers to the areas where energies are concentrated in the speech spectrum and it reflects the physical characteristics of the resonant cavity. In the process of producing vowels and consonants in the oral cavity, the harmonic vibration frequency of the sound is regulated by the sound cavity, which is strengthened or attenuated irregularly, and the region with high degree of enhancement forms the resonance peak. In the spectrum of vowels, the first three resonant peaks play a key role in the quality of sound. The first two resonant peaks are particularly sensitive to the height of the tongue position. The higher the first resonance peak, the lower the tongue position, and the second and third formants also have a certain relationship with the tongue position, but the relationship between them is not particularly prominent. Therefore, the first resonance peak is chosen as the judging basis for the pronunciation quality. In this paper, the resonance peak is extracted using linear prediction method [14]. Regarding the sound channel as a resonant cavity, then the resonant peak is the resonant frequency of the wall. 3.5 Function and interface design The oral English training system based on the Android smartphone platform can provide effective feedback to learners' oral English pronunciation through animation, audio, video and image forms. The function design of the system is as follows: First of all, the system should have standard pronunciation audios and videos to guide learners, and introduce the key points of English pronunciation and tongue type in the form of pictures and texts. Before establishing the system, spoken phonetic materials such as phonetic symbols, words, and sentences need to be collected. Folders of pictures, videos and texts should be established separately for system access. We use AudioTrack for audio and video playback, specifically, class method for speech signal playback and Video View class method in Android SDK for video playback. Secondly, the system should be able to prompt the learner to read the words and phrases, record and play back the voice signals, create a cache folder, and record the recorded voice signals according to the MP3 format. AudioRecond class method is used to record voice signals, and the sampling frequency is set to 8000Hz, channel mono, 16-bit sampling bits. Then, the system uses the speech recognition and related algorithms to score learners' spoken pronunciations and establish a spoken appraisal folder. The Shared Preferences component of the Android system is used to store the learner's spoken rating results. Finally, the system should have the function of comparing learner's spoken pronunciation with standard pronunciation, and use the Achart Engine to show the comparison chart of the changes of formant to the oral learners so as to make the learners find the problems more intuitively. Besides, the system should give advice on spoken pronunciation based on the relationship between tongue shape, mouth shape and signal formant. Interface design: The main interface includes four oral training options of vowels, consonants, words and sentences and the learners can choose the items according to their own willingness. At the same time, the resonance peak comparison chart and historical scoring items are added on the main interface to facilitate learners to view. Help options and exit keys are also set. Training score interface elements include pronunciation demonstration, pronunciation following, pronunciation contrast, pronunciation evaluation, main menu, oral F. Zhu demonstration (animation, audio and video, pictures and other forms) and the corresponding text description. The development of the system is mainly done in the Eclipse integration environment. Specific development and operating environment: PC operating system Windows7 (32bt); Development components: Java JDK 8.0, Eclipse [15] 4.5 (Mars), Android SDK 4.0; Hardware Environment: Glory Play 6X (RAM: 3GB, ROM: 32G, Android 6.0); Programming Language: Java. Figure 2 shows the interface effect. Figure 2: Oral training system main interface and rating interface. 4 System test results This study invites three experienced English teachers as score judges and 10 college students as the subjects of the oral English training system. The scoring is based on tongue type, mouth type, pronunciation completeness and clarity, 25 points for each item. The average of the scores given by the three teachers is taken as the final score. 4.1 Speech input test First, the subjects' speech was recorded and the recognition rate of the speech input system was tested. According to the system instructions, the subjects read after the system of 20 monophthong words, 15 diphthong words and 15 polysyllabic words. The three teachers judged whether the speech input was successful and the results are shown in Table 1. Word type Monophthong Diphthong Polysyllabic Total number 20 15 15 Accuracy 100% 100% 96% Table 1: Speech input success rate of the system. 4.2 Scoring accuracy test Based on the speech input, scores were given by the system and the three teachers respectively. Suppose the system score of the i-th word was , and the score by the teachers was , the similarity of the two scores was calculated according to the formula , then, the scoring accuracy of n samples can be calculated based on , as shown in Table 2. Word type Monophthong Diphthong Polysyllabic Total number 20 15 15 Accuracy 97.15% 94.96% 93.62% Table 2: System test accuracy results. As shown in Table 2, the accuracy on monophthong word pronunciation reached 97.15%, and that on diphthong words and polysyllabic words reached 94.96% and 93.62% respectively. With the increase of vowels in the words, the pronunciation became complicated, which affected the scoring mode. But, the scoring accuracy reached above 90% on average. Conclusion With the rapid development of mobile network and the upgrading of mobile network equipment, the concept of mobile learning has been gradually integrated into our life. This paper focused on mobile learning and designed an oral English training system that could be used on Android smartphones. Firstly, we designed the speech signal preprocessing, feature extraction and signal pattern matching of system speech recognition. According to the characteristics of Android system, the dynamic time regulation algorithm with small amount of computation was introduced as the pattern matching algorithm of speech signals. Then, according to the pronunciation characteristics of spoken English, we selected the pronunciation resonance peak as the reference of system scoring, and determined the single reference template as the scoring method. Afterwards, the system functions were designed from the three aspects of pronunciation demonstration, pronunciation imitation and pronunciation evaluation. Finally, the system was tested, the results of which showed that the system had a high success rate in the recognition of the spoken word pronunciation and a high accuracy in spoken English scoring. In general, the system we designed initially met the needs of speech accuracy and scoring accuracy of mobile spoken English training, and provided some points that need attention in pronunciation, which is helpful for spoken English training. Informatica 42 (2018) 259–264 263 6 References [1] An L L, Wu Y N, Liu Z, Liu RS (2012). An Application of Mispronunciation Detecting Network for Computer Assisted Language Learning System. Journal of Electronics & Information Technology, 34(9), pp. 2085-2090. [2] Alamer R A, Al-Otaibi H M, Al-Khalifa H S (2015). L3MS: A Lightweight Language Learning Management System Using Mobile Web Technologies, 2015 IEEE 15th International Conference on Advanced Learning Technologies (ICALT), IEEE, Hualien, Taiwan, pp. 326-327. [3] Milutinovic M, Bojovic Z, Labus A, Bogdanovic B, Despotovic-Zrakic M (2016). Ontology-based generated learning objects for mobile language learning. Computer Science & Information Systems, pp. 4-4. [4] Troussas C, Virvou M, Alepis E (2014). Multifactorial user models for personalized mobile-assisted language learning. Frontiers in Artificial Intelligence & Applications, 262, pp. 275-282. [5] Sharples M, Arnedillosánchez I, Milrad M, Vavoula G (2014). Mobile Learning. R Keith Sawyer, pp. 501-521. [6] Hu Y, Azim T, Neamtiu I (2015). Versatile yet lightweight record-and-replay for Android. ACM Sigplan International Conference on Object-Oriented Programming, Systems, Languages, and Applications, ACM, Newyork, USA, pp. 349-366. [7] Deepa D, Shanmugam A (2011). Enhancement of noisy speech signal based on variance and modified gain function with PDE preprocessing technique for digital hearing aid. Journal of Scientific & Industrial Research, 70(5), pp. 332-337. [8] Takagi T, Seiyama N, Miyasaka E (2015). A method for pitch extraction of speech signals using autocorrelation functions through multiple window lengths. Electronics & Communications in Japan, 83(2), pp. 67-79. [9] Sahoo T R, Patra S (2014). Silence Removal and Endpoint Detection of Speech Signal for Text Independent Speaker Identification. International Journal of Image Graphics & Signal Processing, 6(6), pp. 27-35. [10] Valentini-Botinhao C, Yamagishi J, King S (2012). Mel cepstral coefficient modification based on the Glimpse Proportion measure for improving the intelligibility of HMM-generated synthetic speech in noise. Proc. Interspeech., 631-634. [11] Rathore P S, Boyat A, Joshi B K (2013). Speech signal analysis using Fourier-Bessel Expansion and Hilbert Transform Separation Algorithm. IEEE International Conference on Signal Processing, Computing and Control, IEEE, Solan, India, pp. 1­4. [12] Dhingra S, Nijhawan G, Pandit P (2013). Isolated speech recognition using MFCC and DTW. International Journal of Advanced Research in Electrical Electronics & Instrumentation Engineering, 2(8), pp. 4085-4092. [13] Lang F Y, Li X G (2012). Multi-Sensors Information Fusion Based on Momentis Method and Euclid Distance. Advanced Materials Research, 383-390(383-390), pp. 5447-5452. [14] Yusnita M A, Paulraj M P, Yaacob S, Bakar SA, Saidatul A (2011). Malaysian English accents identification using LPC and formant analysis. IEEE International Conference on Control System, Computing and Engineering, IEEE, Penang, Malaysia, pp. 472-476. [15] Wang L, Groves P, Ziebart M (2013). Urban Positioning on a Smartphone: Real-time Shadow Matching Using GNSS and 3D City Models. Proceedings of the 26th International Technical Meeting of The Satellite Division of the Institute of Navigation, Nashville Convention Center, pp. 1606-1619. [16] Thakral S, Goswami D, Sharma R, Prasanna CK, Joshi AM (2016). Design and implementation of a high speed digital FIR filter using unfolding. IEEE, Power India International Conference, pp. 1-4. [17] Han Z, Wang J (2016). Dynamic feature extraction for speech signal based on MUSIC. Control and Decision Conference, pp. 3770-3773. Probability Matrix Decomposition Based Collaborative Filtering Recommendation Algorithm Yili Tan and Huijuan Zhao Department of Statistics, College of Science, North China University of Science and Technology, Hebei, 063210, China E-mail: tangyili_tyl@163.com Yourong Wang Department of Basic, Tangshan College, Hebei, 063000, China Min Qiu Public Mathematics Department, Hubei University of Automotive Technology, Hubei, 442002, China Technical Paper Keywords: probability matrix decomposition, collaborative filtering, recommendation algorithm Received: April 8, 2018 With the development of the society, the increased amount of information has extensively appeared on the Internet. It includes almost all the content we need. But information overload makes people unable to correctly find the information they need. Collaborative filtering recommendation algorithm can recommend items for users according to their demands. But traditional recommendation algorithm which has defects such as data sparsity needs to be improved. In this study, the collaborative filtering recommendation algorithm was analyzed, an improved collaborative filtering recommendation algorithm based on the probability matrix decomposition was put forward, and the feasibility of the algorithm was verified. Moreover the traditional algorithms including user based collaborative filtering algorithm, item based collaborative filtering algorithm, singular value decomposition based collaborative filtering algorithm and basic matrix based collaborative filtering algorithm were tested. The test results demonstrated that the proposed algorithm had a higher accuracy compared to the traditional algorithms, and its mean absolute error and root-mean-square error were significantly smaller than those of the traditional algorithms. Therefore it can be applied in the daily life. Povzetek: V sestavku je predstavljena dekompozicija verjetnostne matrike s priporočilnim algoritmom na osnovi skupinskega filtriranja. 1 Introduction He increased amount of information which appeared due automation, but it exhibits a few problems such as sparsity to the development of Internet technology increases the and system extensibility. Therefore, in this study a difficulty of finding the target information. Therefore, probability matrix decomposition based collaborative many recommendation algorithms were proposed. Such filtering algorithm was put forward to correct up the recommendation algorithms can filter information defects of the traditional collaborative filtering algorithm according to the personal preference; hence they have and performed simulation experiments. The experimental been universally applied in fields such as web browsing, results suggested that the mean absolute error (MAE), the film recommendation and e-commerce [1]. Li [2] root-mean-square error (RMSE) and the accuracy of the analyzed the sales records in the current tea leaves sales algorithm could reach the expected levels. This work system by combining Hadoop distributed system with the provides a reference for the application of probability traditional collaborative filtering algorithm to obtain the matrix decomposition based collaborative filtering recommendation rules which could satisfy the preference recommendation algorithm in the searching of Internet of customer and help users find the tea leaves they needed. information. Yu et al. [3] proposed the weighed cloud model attributes based service cluster algorithm and calculated 2 Collaborative filtering the user score similarity using the weighed Pearson correlation coefficient method of service cluster algorithm recommendation algorithm and the user service selection index weight. They found that the algorithm could accurately calculate service 2.1 Collaborative filtering algorithms recommendation credibility, satisfying the demands of based on different elements users on service credibility, and enhance the success rate of the user service selection. The collaborative filtering algorithm has high degree of individualization and 2.1.1 Collaborative filtering algorithm based on users User based collaborative filtering algorithm focuses on users. It recommends using user-item score matrix. It firstly searches for users which are similar to the target users and then recommends the selection of the searched users to the target users. The algorithm has two functions, i.e. one for calculating the similarity between adjacent users to establish matrix and one for recommending the target users using algorithm evaluation method. 2.1.2 Collaborative filtering algorithm based on items Item based collaborative filtering algorithm can provide recommendations to users based on evaluation data after establishing user-item evaluation data model. In details, it calculates the similarity between different items to determine the preference of target users and then recommends similar items to target users. The algorithm has functions for calculating the similarity between items, establishing similarity matrix and recommending target users by scoring similar items using algorithm evaluation method. 2.2 Collaborative recommendation algorithm based on probability matrix decomposition The probability matrix decomposition can reflect the information of users and items to low-dimensional characteristic space in the aspect of probability and then analyze the concerns of uses about items using the linear combination of low-dimensional vectors [4]. Item score matrix could be expressed as ; a matrix whose mean value was 0 and variance was and a random number matrix whose mean value and variance were 0 and respectively were produced by MATLAB [5], in which a refers to the dimension of decomposition, refers to a-dimensional characteristic square matrix of users, and refers to the a-dimensional characteristic square matrix of item. Vector and were the corresponding potential characteristic vectors. In general, . The matrix was obtained through the learning of machine training. Suppose the mean value of the error between actual score and predicted score as 0 and the variable of and as Gaussian distribution of , then the probability distribution is . was obtained through translation. Then the condition of the score matrix F was: Y. Tan et al. , (1) Where stands for indicator function, means user m has scored item n, and means user m has not scored item n. As M and N could not include each other, the mean value of M and N was 0, and and had Gaussian distribution, then , (2) , (3) Where Q stands for probability. The joint probability distribution of M and N can be obtained from equation (1), (2) and (3). (4) The logarithm of the probability distribution of M and N was calculated: . (5) The maximum solution of equation (5) was replaced with the minimum solution of error function containing normalization parameters [6]: , (6) Where and . As , then target function was: (7) The relationship between regularization parameter and can be obtained from the equation (7). The algorithm calculated function using stochastic gradient descent method [7]. It could obtain the decline direction of numerical values using derivatives and then calculate variables constantly on this direction until the minimal point was obtained. The solution of the point suggested that the updating formulas of were transformed to the following formulas in each iteration: , (8) , (9) , (10) Where stands for the learning rate of the stochastic gradient descent. Experiment 3.1 Experimental data A 100k data set originated from the movies provided by GroupLens project team from University of Minnesota were used in the experiment, denoted as data set A. Data set A included 100,000 scores for 1,682 movie items given by 943 users. Each user scored 20 movie items at least. The score was an integer between 0 and 5. The more the user liked the movie the higher was the score. The sparseness of the data set A suggested the percentage of the movie items which were not scored by the users, i.e. 1-100,000/(943×1682) . 0.937. The 100,000 scores in data set A were randomly divided into two disjoint sets, the training and the testing set. The training set which included 80% of the data was expressed as S1, while the testing set which included 20% of the data was expressed as S2. The data set A was divided 10 times to perform cross validation on the algorithm. To enhance the recommendation efficiency of the algorithm, batch processing module was added. The 100000 scores were divided into 10 batches. 10000 scores were processed every time. This way, the computational quantity of the system and the convergence instability of the model produced in calculation could be reduced. The collaborative filtering recommendation algorithm based on probability matrix decomposition performed as follows. Input: training set and testing set Output: Predicted score and square root error Data such as regularization parameter were set The number of movies and users were set. If the iteration epoch < max epoch, then the 100000 scores were divided into 10 groups, 10,000 in each group, for separate processing. If the patch processing was lower than 10, then the loss function q was calculated, and then matrix calculation was performed. End The predicted scores in the testing set were revised to positive integers through rounding off, and then square root error was calculated. End 3.2 Scoring criteria 3.2.1 MAE The MAE measure included the calculation of the absolute and average values of the difference between a predicted score and a real score [8]; hence it could be used for detecting the average difference between a predicted score and a real score. The smaller the value of MAE was, the more accurate the algorithm was. Informatica 42 (2018) 265–271 267 , (11) where stands for the predicted score of the user e on item k, stands for the real score of the user e on the item k, set M and N stand for the sets of users and items in the testing set, and c stands for the number of or . 3.2.2 Root-mean-square error Root-mean-square error refers to the average value of quadratic sum of the error between the two scores. The smaller the root-mean-square error was, the more accurate the prediction was [9]. , (12) where stands for the predicted score of the user e on item k, stands for the real score of user the e on item k, set M and N stand for the sets of users and items in the testing set, and c stands for the number of or . 3.2.2 Accuracy Accuracy could be expressed as: , (13) Where , i.e. set X was the set of the predicted scores which were equal to the real scores in the testing set, (D was the set of the predicted scores), and (R was the set of the real scores). Both, the corrected probability of an item and the prediction accuracy, could be recommended to users. 3.3 Design of experiment The specific content of the experiment was as follows. To analyze the application performance of probability matrix decomposition based collaborative filtering algorithm in the experimental aspect, the movie evaluation mentioned in the preceding text was taken as the data set, and the user based collaborative filtering algorithm, the item based collaborative filtering algorithm and the probability matrix decomposition based collaborative filtering algorithm were compared. To better analyze the application performance of the probability matrix decomposition based collaborative filtering algorithm, the other two algorithms, i.e. the basic matrix based collaborative filtering algorithm and the singular value decomposition based collaborative filtering algorithm, were also tested. The parameter setting of the algorithms is shown in Table 1. The user based collaborative filtering algorithm and the item based collaborative filtering algorithm were tested six times. The algorithm itself corresponds to the six characteristic factor numbers (dimension k) of the probability matrix decomposition collaborative filtering algorithm and the singular value decomposition based collaborative filtering algorithm. The basic matrix Algorithm User based collaborative filtering algorithm Item based collaborative filtering algorithm Probability matrix decomposition collaborative filtering algorithm Basic matrix collaborative filtering algorithm Singular value decomposition based collaborative filtering algorithm Neighbourhood or model Neighbourhood Neighbourhood Probability matrix decomposition collaborative filtering model Basic matrix collaborative filtering model Singular value decomposition based collaborative filtering model Characteristic factor number (dimension k) \ \ [10,60] [10,60] 6 Learning rate \ \ 0.02 0.02 \ Table 1: The parameter setting of the five algorithms. collaborative filtering algorithm and the singular value decomposition based collaborative filtering algorithm used in the performance comparison were developed by referring to the relevant literature and revised according to the data which needed to be detected. The characteristic factor number of singular value decomposition based collaborative filtering algorithm was fixed, 6. The setting of characteristic factor number of the basic matrix collaborative filtering algorithm was the same as the probability matrix decomposition collaborative filtering algorithm, [10, 60], and the unit stepping was set to10. Experimental results and analysis 4.1 MAE and RMSE It could be noted from the Figure 1 and 2 that the predictive recommendation performance of the user based collaborative filtering algorithm was the poorest, and the prediction performance of the singular value decomposition based collaborative filtering algorithm was not affected by characteristic factor number, but was poorer than that of the user based collaborative filtering algorithm. The performance of the user based collaborative filtering algorithm was far worse than that of the probability matrix decomposition collaborative filtering algorithm and the basic matrix based collaborative filtering algorithm. The performance of the probability matrix decomposition collaborative filtering algorithm and the basic matrix based collaborative filtering algorithm was similar, but the probability matrix decomposition collaborative filtering algorithm was still superior. The reason why there was a significant difference between the performance of the user based collaborative filtering algorithm and the item based collaborative filtering algorithm is the score given by a scorer was probably affected by the view of other scorers who had the same interests. The reason why the performance of the singular value decomposition based collaborative filtering algorithm in the prediction and recommendation was significantly poorer than that of the basic matrix collaborative filtering algorithm and the probability matrix decomposition collaborative filtering algorithm was the fact that the singular value decomposition based collaborative filtering algorithm was actually an improved version of the item based collaborative filtering algorithm and therefore had similar shortcomings as the original algorithm, i.e., the matrix obtained after the preprocessing had data distortion compared to the original matrix, which could have affected the accuracy and the similarity of the score prediction. But the performance of the singular value decomposition based collaborative filtering algorithm was better than of the item based collaborative filtering algorithm, indicating the improved accuracy of the singular value decomposition based collaborative filtering algorithm. It could be noted from the Figure 1 that the values of the root-mean-square error (RMSE) corresponding to the basic matrix based collaborative filtering algorithm and the probability matrix decomposition collaborative filtering algorithm gradually decreased with the increase of the characteristic factor number; the larger the characteristic factor number, the smaller the decrease amplitude. When the characteristic factor number was 50, the value of RMSE was the minimum, and the prediction accuracy was the highest; when the characteristic factor number was between 10 and 20, the decrease amplitude of RMSE of the basic matrix based collaborative filtering algorithm and the probability matrix decomposition collaborative filtering algorithm was large, around 1.14% and 0.700% respectively. It was found that the values of the RMSE of the two algorithms were lowly sensitive to the characteristic factor number, especially of the probability matrix decomposition collaborative filtering algorithm. When the characteristic factor number was larger than 40, the fluctuation of the RMSE was quite small. Similar to Figure 1, the MAE corresponding to the basic matrix based collaborative filtering algorithm and the probability matrix decomposition based collaborative filtering algorithm also decreased with the increase of the characteristic factor number and reached the minimum values, 0.675 and 0.666 respectively, when the characteristic factor number was 60. Moreover it was noted that when MAE was taken as the evaluation index, the curves of the basic matrix collaborative filtering algorithm and the probability matrix decomposition based collaborative filtering algorithm nearly coincided, and the prediction performance was also close. Moreover, since the RMSE is more sensitive to the measurement error, the probability matrix decomposition based collaborative filtering algorithm had an advantage over the basic matrix collaborative filtering algorithm due to the addition of the regularization term. Figure 1: Variation of the RMSE of the five algorithms with the increase of the characteristic factor number. Figure 2: Variation of the MAE of the five algorithms with the increase of the characteristic factor number. 4.2 Accuracy It was found from the comparison of the RMSE and the MAE between the five algorithms that the prediction performance of the CF-User, the item based collaborative filtering algorithm and the singular value decomposition based collaborative filtering algorithm was significantly different from the basic matrix based collaborative filtering algorithm and the probability matrix decomposition based collaborative filtering algorithm. Therefore, only the accuracy of the basic matrix based collaborative filtering algorithm and the probability matrix decomposition based collaborative filtering algorithm were considered. The results are shown in Table 2. It could be noted from Figure 3 that the tendency of the accuracy of the basic matrix collaborative filtering algorithm and the probability matrix decomposition based collaborative filtering algorithm was opposite to the tendencies of the MAE and the RMSE. When the characteristic factor number was small, the accuracy was low; the accuracy increased first and then tended to be stable with the increase of the characteristic factor number Characteristic factor number PMF BMF 10 0.40321 0.34666 20 0.41321 0.36022 30 0.41403 0.36142 40 0.41300 0.36134 50 0.41299 0.36132 60 0.41298 0.36132 Figure 3: The accuracy of the basic matrix based collaborative filtering algorithm and the probability matrix decomposition based collaborative filtering algorithm under different characteristic factor numbers. and nearly had no fluctuation when the characteristic factor number was larger than 30. It was because the effective information increased with the increase of the characteristic factor number. The accuracy of the probability matrix decomposition based collaborative filtering algorithm was much higher than that of the basic matrix based collaborative filtering algorithm. Therefore the proposed algorithm could improve the searching speed and preciseness. Figure 4: Comparison of the accuracy of the probability matrix decomposition based collaborative filtering algorithm (PMF) and the basic matrix based collaborative filtering algorithm (BMF). The recommendation system which can filter diversified data is an effective filtering approach [10]. It can recommend individual information to users according to users’ requirements. Therefore it can be convenient for information collection and has been extensively applied on the Internet. Wei et al. [11] put forward project category similarity and interestingness measure based collaborative filtering recommendation algorithm which could recommend information to users through calculating project categories and interestingness and had high prediction preciseness. In a study of Chen et al. [12], a mixed recommendation system was put forward to recommend users with learning projects searching. In the test, the algorithm effectively collected information, suggesting a favorable performance. Conclusion In conclusion, the probability matrix decomposition based collaborative recommendation algorithm was put forward in this study, and then it was developed for data searching recommendation. Afterwards the MAE, the RMSE and the accuracy of the algorithm were tested. Moreover the MAE values, the RMSE values and the accuracy of the CF-User, the item based collaborative filtering algorithm, the singular value decomposition based collaborative filtering algorithm and the basic matrix based collaborative filtering algorithm were compared. The testing results suggested that the improved collaborative recommendation algorithm had the highest preciseness and accuracy, and the preciseness and the accuracy became the largest and stable when the characteristic factor number was more than 40. Therefore it could be applied in a computer searching system. This work Y. Tan et al. provides a reference for the progress of the collaborative recommendation algorithm. 6 Acknowledgement This study was supported by the Fund of Natural Science of Hebei (A2015209040). 7 References [1] Wang W, Wu YH, Wu YY. (2016). A Multi-stage Heuristic Algorithm for Matching Problem in the Modified Miniload Automated Storage and Retrieval System of E-commerce. Chinese Journal of Mechanical Engineering, 29(3):1-8. [2] Li L. (2017). Analytical Application of Hadoop-Based Collaborative Filtering Recommended Algorithm in Tea Sales System. International Conference on Applications and Techniques in Cyber Security and Intelligence. Edizioni della Normale, Cham, 434-441. [3] Yu Z Y, Wang J D, Zhang H W, Niu K. (2016). Services recommended trust algorithm based on cloud model attributes weighted clustering. Automatic Control & Computer Sciences, 50(4):260­270. [4] Zhang F, Yang J, Tai Y, Tang J. (2015). Double Nuclear Norm Based Matrix Decomposition For Occluded Image Recovery and Background Modeling. IEEE Trans Image Process, 24(6):1956­1966. [5] Simpson R, Devenyi G A, Jezzard P, Hennessy TJ, Near J. (2017). Advanced processing and simulation of MRS data using the FID appliance (FID-A)-An open source, MATLAB-based toolkit. Magnetic Resonance in Medicine, 77(1):e56841. [6] Ando T, Bai J. (2015). Selecting the Regularization Parameters in High-dimensional Panel Data Models: Consistency and Efficiency. Econometric Reviews, 151014164537003. [7] Wu K, Sun Y, Huai Y, Jia SQ, Chen X, Jin YQ. (2015). Multi-perturbation stochastic parallel gradient descent method for wavefront correction. Optics Express, 23(3):2933-44. [8] Franses P H. (2015). A note on the Mean Absolute Scaled Error. International Journal of Forecasting, 32(1):20-22. [9] Nuutinen M, Virtanen T, Häkkinen J. (2016). Performance measure of image and video quality assessment algorithms: subjective root-mean-square error. Journal of Electronic Imaging, 25(2):023012. [10] Knobelsdorff P M G V, Reiferscheid F, Straakholder T M, Wirtz MS. (2004). Wird der Algorithmus des European Resuscitation Council zur kardiopulmonalen Reanimation eingehalten?. Intensivmedizin Und Notfallmedizin, 41(1):22-28. [11] Wei S, Ye N, Zhang S, Huang X, Zhu J. (2012). Item-Based Collaborative Filtering Recommendation Algorithm Combining Item Category with Interestingness Measure. International Conference on Computer Science and Service System. IEEE Computer Society, 2038-2041. [12] Chen W, Niu Z, Zhao X, Li Y. (2014). A hybrid recommendation algorithm adapted in e-learning environments. World Wide Web-internet & Web Information Systems, 17(2):271-284. The Decision Model for the Optimal Configuration Management of Knowledge Employees in Enterprises Chunjiang Yang Yantai Nanshan University, Longkou, Shandong E-mail: yangyujiang_123@sina.com Yujiang Yang School of Mining Engineering, University of Science and Technology Liaoning, Anshan, Liaoning, China Xiuli Tan Yantai Nanshan University, Longkou, Shandong, China Technical Paper Keywords: knowledge employee, bacterial foraging algorithm, employee dispatch Received: April 3, 2018 With the development of knowledge economy, the role of knowledge employees in enterprises is becoming more and more important, and employees with rich knowledge represent the tower of strength in the development of enterprises. Knowledge employees with strong autonomy and creativity have stronger self-awareness in daily work. Thus, how to dispatch knowledge employees and assign them to proper positions, is of utter importance. To solve the problem of employee dispatch, bacterial foraging algorithm was proposed to optimize the configuration management of knowledge employees in enterprises in this study, a mathematical model based on the optimal configuration of knowledge employees was established, and a simulation experiment was carried out on the model. It was found that the algorithm could effectively optimize the configuration of knowledge employees. It is a feasible method for employee dispatch in enterprises. Povzetek: Predstavljena je metoda za organiziranje zaposlenih v družbi znanja. Introduction Since the 21st century, information age has started, and knowledge economy has gained a rapid development. Producing products using cheap labour force has not been able to ensure the survival and development of enterprises; therefore, enterprises should improve its competitiveness by taking advantages of knowledge employees. Knowledge employees are the most active cells in enterprises, and the task assignment and scheduling for them directly determines the coordination between user demand, employee personality and enterprises benefits, the utilization of knowledge resources and the competitiveness and values of enterprises. Bogdanowicz et al. [13] considered that managing knowledge could bring challenges to the field of human resource development, especially when employees concerned more about their employment ability, and that an enterprise should pay attention to knowledge employees if placing great importance on knowledge. But knowledge employees are of high risks to leave enterprises because of their autonomy and creativity. How to optimize the configuration of knowledge employees is an urgent problem. A project may not be completed in time if decisions are made only considering the task assignment and scheduling of employees involved in that project regardless of the integrity of multiple projects. Through discussing the concept, connotation and composition of knowledge cooperative capability, Cao [1] concluded the influence factors and established an evaluation indicator system for the flood carrying capacity of knowledge employees in enterprises based on collaboration through introducing the concept of cooperation process, in order to improve the cooperative ability of knowledge employees and ensure the effective development of enterprises. Liu [2] introduced the loyalty of knowledge employees in state-owned coal enterprises, analysed the main reason for the decrease of loyalty in details, and finally concluded the strategies for improving loyalty, which was beneficial to the correct configuration of human resources and the improvement of enterprises. Based on the study of optimal configuration of knowledge employees, this study put forward working arrangement and dispatch for knowledge employees based on bacterial foraging algorithm to highlight their advantages and satisfy the demands of clients efficiently. Bacterial foraging algorithm was used to establish a mathematical model for the optimal configuration of knowledge employees, and then the feasibility of the algorithm was verified by a simulation experiment. 2 Configuration of knowledge employees in enterprises 2.1 Knowledge employee The term “knowledge employee” refers to a person that works by applying symbols and concepts and using knowledge or information [3]. Knowledge employees as the carriers of knowledge play a vital role in the competition between enterprises, the innovation and utilization of knowledge and the reasonable configuration of resources [4]. Compared to non-knowledge employees, knowledge employees have the following characteristics. (1) Employees have specialty in knowledge, good education background and high individual quality. (2) They have strong wishes for realizing self-worth and high requirements and are interested in challenging works. (3) With strong creativity and autonomy, they can independently fulfil most of creative tasks and produce new knowledge achievements based on their own knowledge [5]. (4) The working process is difficult to be monitored. The work form of knowledge employees with high randomness and subjectivity is different with that of employees in traditional workshops and offices. (5) Their work results cannot be measured directly. The labour achievements created by knowledge employees exist in the form of inventions which cannot be directly measured by ordinary economic forms [6]. 2.2 Configuration management of knowledge employees The competition between enterprises and knowledge creation in enterprises are usually fulfilled by knowledge employees. The configuration management of knowledge employees refers to fulfil tasks using the shortest time. 3 The decision model for optimal configuration 3.1 Bacterial foraging algorithm Bacterial foraging algorithm is a swarm intelligent optimization algorithm which achieves optimization through chemotaxis [7], reproduction and migration behaviours [8] according to the basic rules of the growth and evolution of bacteria [16]. Firstly, the individual evolutionary mechanism is formulated according to the reproduction rules of bacteria. Then the motion patterns of individuals in the algorithm are established according to the characteristics of bacterial foraging. Finally, the information share system was established according to the information exchange means of bacterial in colonies. C. Yang et al. 3.2 The flow of solving optimal configuration with bacterial foraging algorithm X, Y and Z were set as the maximum execution counts of taxis, reproduction and migration operations respectively. x, y and z are the count values of the three operations. U stands for the maximum swimming distance, and w stands for the number of swimming steps. (1) The parameters of the algorithm were initialized, and X, Y and Z were set to 1. (2) Taxis operation was performed, x‹x+1. The bacteria were cycled. Before reaching the maximum times, the bacteria were turned over and swam. (3) After taxis operation, reproduction operation was performed if the reproduction number did not reach the maximum value Y. Then the second procedure was repeated. The next step was done if the value reached Y. (4) Then the bacteria were processed by migration operation, and the number of bacteria was initialized randomly. (5) If the maximum migration number did not reach Z after migration operation, then it returned to the second procedure; if it did, then the algorithm ended and the result was output. The flow of the optimal configuration based on bacterial foraging method is shown in Figure 1. 3.3 Modelling Figure 2 shows the flow of mathematical modelling. 3.3.1 Embodiment of abstract problems Based on the consideration on actual situation, this study found that staff dispatch and allocation should satisfy three conditions, i.e., the maximum customer satisfaction, uniform workload of knowledge employees and minimum enterprises cost [9]. Therefore, the optimal configuration of knowledge employees was implemented based on the three conditions. The details of the three conditions are as follows. (1) Customer satisfaction maximization refers to maximizing satisfaction of customers on the services provided by enterprises, which is beneficial to the long­term development of enterprises benefits. Client satisfaction can be expressed by the time spent on tasks. Maximum client satisfaction also means the shortest time spent on tasks. (2) Uniform working load of knowledge employees refers to uniform allocation of working load according to the working ability and conditions of employees. To realize the uniform allocation of workload, the work time and average time difference of employees should be the minimum. (3) Enterprise cost minimization refers to minimize resource cost in the process of production and services. The main purpose of enterprises is profits. Minimizing enterprise cost can efficiently gain profits. Figure 1: The flow of optimal configuration based on bacterial foraging algorithm. 3.3.2 Assumption To effectively study resource optimization, the working scheduling of knowledge employees was assumed firstly. The starting time point of tasks was set to 0. Each employee could only do one task simultaneously. Each task could only be fulfilled by one employee. When tasks have been going on, the other tasks should have been waiting. The completion time of each task was associated to the capacity of the employee. Next task could start after last task was completed. Only broad setting was given, thus hypothesis needs to be formulated according to the actual condition of enterprises in practical application. 3.3.3 Setting variables To simplify modelling, variables were designed after assumption. (1) M: a set of m projects, (2) N: a set of n knowledge staffs, ( ) (3) : a sequenced set of j tasks of project i, ( ) (4) : whether knowledge employee could be competent for task . If he could, then ; otherwise, . (5) : time for employee completing task (6) : the starting time of task (7) : the end time of task (8) : cost for employee completing task For individual employee, was the set of tasks which needed to be fulfilled: ( ) stands for whether task could be allocated to employee ; means the task could not be allocated to employee ; means the task could be allocated to employee . refers to the time when employee started the r- th task; refers to the time when employee completed the r-th task. 3.3.4 Establishment of mathematical model Mathematical expression models were proposed against the three conditions which should be satisfied in dispatching and allocation. Customer satisfaction could be expressed as: (1) The workload of knowledge employees can be expressed as: (2) Enterprise cost minimization can be expressed as: (3) The configuration management of enterprises knowledge employees was optimized by three mathematical formulas. The formula of the decision model was: (4) Figure 2: The flow of mathematical modelling. Then equation (4) was processed by normalization using . The three targets were set as , and . Then the objective function of the model was: (5) 3.3.5 Establishment of constraint condition A constraint condition was needed before every time of decision making. When the decision-making objective was formulated, there was also a constraint condition. The constraint condition was: . (6) Formula (6) means every task was independent and could only be completed by one knowledge employee. The formulas of the starting time points when taking tasks and knowledge employees as subjects respectively are: , (7) . (8) The formulas of the completion time points when taking tasks and knowledge employees as subjects respectively are: , (9) . (10) Formula (11) was used to ensure allocating a task to a knowledge employee when he was able to complete the task. Formula (12) was used to ensure that the set variable was non-negative. (11) (12) 4 Simulation experiment and analysis of the results Task assignment and scheduling of employees are most widely used in transportation system [14]. In the field of transportation, they have the same characteristics. They are limited by time and space; each task must take into account the start time and location and ending time and location. It has the most extensive application in the scheduling of airline crew, followed by the scheduling of drivers in the automotive industry. In the field of health care, it is necessary to take into account whether the nurses have appropriate qualifications and whether they are long­term in-service nurses or short-term in-service nurses to ensure the fairness of nurses' shifts at night and on weekends, and the vacation and housing problems of them. To ensure the even distribution of workload, the bacterial foraging method can be used to optimize the configuration management during the task assignment and scheduling. It considers whether it is possible to insert a task into the existing time slot. If it is, then the time can be shortened, and the feasible solution can be optimized. Finally, the management of scheduling can be achieved with the least time and cost. A project team was set up to solve the performance management related projects of an enterprise. The members of the project team should satisfy the following conditions. They should know the core knowledge of enterprises operation and development, have strong bearing capacity to shoulder tasks with different difficulties which were allocated by enterprises and be responsible for management works. According to the characteristics of the team, the members in the team were processed by optimal configuration management. A simulation experiment was performed to verify the role of decision model in optimal configuration of resources. 4.1 Basic data In the process of experiment, an issue of optimal configuration of knowledge employees was proposed. The issue included two projects, each project contained six tasks, and each task was completed by one employee. The experimental data are shown in Table 1. Mi M1 M2 Pij P11 P12 P13 P14 P15 P16 P21 P22 P23 P24 P25 P26 Nk N1 5/0.8 / 6/1.8 / 10/3 6/2.8 11/2.1 / 6/1.8 / 8/1.5 / N2 6/3.8 9/4.8 4/0.8 / 4.1/1.8 11/5.8 10/3.3 5/1 9/3.2 / 10/2.6 7/2 N3 4/0.8 / 5/1.2 16/6 / 7/4.1 14/4.2 / 6/1.1 16/7.1 / 9/2.6 N4 / 8/3.8 / 13/5.5 5/1.6 / / 8/2.5 / 18/6.9 13/3.3 / N5 / 7/5.8 / 10/7 8/5.8 9/5.1 / 7/4.4 / 15/5.2 12/2.9 8/5.3 N6 6/6.8 8/4.3 7/7 12/5.2 / / 14/8.1 8/3 10/3 16/5.6 / 10/3.2 Table 1: The time and cost spent on the tasks. The symbol / means the task cannot be completed by the employee. 4.2 Results analysis The algorithm was simulated by MATLAB software [10]. The parameters were set according to actual conditions. The maximum number of iteration was 200, the total number of bacteria was 50, there were 10 chemotaxis procedures, and the maximum number of steps was 4. After the forward direction was determined, the forward step length of bacterial individuals was 0.7, there were three copying procedures and three migration procedures, the number of cleavage cells was half of the total number of bacteria, and the probability of migration was 0.6. The simulation results are shown in Figure 3. Figure 3: Simulation results. Figure 3 demonstrates that the average value and optimal value of the model for the optimal configuration of knowledge employees based on bacterial foraging algorithm were basically the same, suggesting the algorithm had favourable convergence. When it was iterated to the 10th generation, the value began to close to the optimal value and maintained at 1.42 all the time. When it was iterated to the 100th generation, the local optimal solution, i.e., 1.32, was obtained. It indicated that the algorithm had strong global searching ability. 4.3 Performance test of bacterial foraging algorithm To verify the effectiveness of bacterial foraging algorithm in practice, Sphere function, Quartic function and Rastrin function were selected as the benchmark functions, as shown in Table 2. Then the effectiveness of bacterial foraging algorithm and common particle swarm algorithm [15] in the three functions were compared. To ensure the fairness of the results, the population number, operation times and maximum iteration times were set the same for the algorithms. Scale of population was set to 30, the maximum iteration times was set to 1,000, and operation times was 50. Chemotaxis times was set to 150, replication times was 8, and migration/dispel times was 5. After the setting, every example was independently operated for 25 times. The results are shown in Table 3. Benchmark Results Particle Bacterial function swarm foraging algorithm algorithm Average 3.247±30.3 0.716±0.19 value 51 8 Optimal value 3.247 0.416 Standard deviation 3.268e-4 1.045e-5 Average 4.1e­ 1.621e­ value 2±0.051 5±2.18e-5 Optimal value 4.137e-3 3.529e-5 Standard deviation 0.902 5.842e-5 Average 1.216e2±20 1.548e2±20 value .152 .421 Optimal value 1.133e2 1.213e-2 Standard deviation 0.028 4.489e-2 Table 3: The experimental results of the examples. When the times of iteration was the same, the optimal value obtained by bacterial foraging algorithm was smaller than that obtained by particle swarm algorithm. 5 Conclusion Today knowledge employees have been an indispensable part in knowledge-based enterprises. How to optimize the configuration of knowledge employees has been a research task of many experts. Wang and Zheng [11] performed optimal configuration on knowledge employees using particle swarm optimization and found that the algorithm was scientific and effective in the dispatch of knowledge employees. Lin et al. [12] optimized the configuration of knowledge employees in modern enterprises with sliding mode control strategy and Benchmark function Index range Optimal solution Peak value [-100,100] 0 Unimodal [-1.28,1.28] 0 Unimodal [-5.12,5.12] 0 Unimodal Table 2: Benchmark functions. found the robustness of knowledge employee system through offline rectangle inequality. This study realized the optimal configuration of knowledge employees in enterprises using bacterial foraging algorithm, investigated the issue of employee dispatch, and established mathematical decision models. A simulation experiment was carried out to prove the effectiveness of the algorithm in optimizing the configuration of knowledge employees. The convergence performance of bacterial foraging algorithm and particle swarm algorithm was compared using three example functions, and the results demonstrated that the optimal value obtained by bacterial foraging algorithm was smaller than that obtained by particle swarm algorithm. This work can provide a reference for the dispatch of knowledge employees in the future. 6 Acknowledgement This study was supported by 2016 Soft Science Research Project of Shandong Province: Supply Chain Finance Ecological Environment Optimization Research in Shandong Province Under the Background of Supply Side Reform (Project Number: 2016RKA06016). 7 References [1] Cao Y. (2013). The evaluation research on enterprise staff knowledge collaborative ability based on the collaborative process. International Conference on Information Management, Innovation Management and Industrial Engineering, IEEE, pp. 324-328. [2] Liu B. (2011). Reasons and responses of loyalty fall to knowledge-type staff in nationalized coal enterprise. International Conference on Artificial Intelligence, Management Science and Electronic Commerce, IEEE, pp. 2272-2275. [3] Jia J. F., Huang S. and Zhu Z. (2012). Internal Relationship of Action-Oriented Competency of Knowledge Workers. Journal of Northeastern University, Vol. 33, No. 4, pp. 601-604. [4] Rout S. S., Misra B. B. and Samanta S. (2014). Load Allocation in Academic Environment: A Multi Objective PSO Approach. GSTF Journal on Computing, Vol. 3, No. 4, 2014, p. 9. [5] Wang W. T. and Hou Y. P. (2015). Motivations of employees’ knowledge sharing behaviors: A self-determination perspective. Information & Organization, Vol. 25, No. 1, pp. 1-26. [6] Simeth M. and Lhuillery S. (2015). How do firms develop capabilities for scientific disclosure?”, Research Policy, Vol. 44, No. 7, pp. 1283-1295. [7] Hota P. K., Barisal A. K. and Chakrabarti R. Economic emission load dispatch through fuzzy based bacterial foraging algorithm. Electrical Power and Energy Systems. International Journal of Electrical Power & Energy Systems, Vol. 32, No. 7, 2010, pp. 794-803. [8] Sathya P. D. and Kayalvizhi R. (2011). Optimal multilevel thresholding using bacterial foraging C. Yang et al. algorithm. Expert Systems with Applications, Vol. 38, No. 12, pp. 15549-15564. [9] Vargas A., Boza A., Patel S., Patel D. C., Cuenca L. and Bas A. O. (2016). Inter-enterprise architecture as a tool to empower decision-making in hierarchical collaborative production planning. Data & Knowledge Engineering, Vol. 105, pp. 5-22. [10] Vedaldi A. and Lenc K. (2015). MatConvNet: Convolutional Neural Networks for MATLAB. ACM International Conference on Multimedia, ACM, pp. 689-692. [11] Wang Q. and Zheng H. C. (2011). Optimization of task allocation and knowledge workers scheduling based-on particle swarm optimization. Information Technology & Standardization, pp. 574-578. [12] Liu Y., Tang S., Shang C. and Qi H. (2013). Optimization of knowledge workers system by sliding mode control. Control Conference, IEEE, 8309-8314. [13] Bogdanowicz M. S. and Bailey E. K. (2013). The value of knowledge and the values of the new knowledge worker: generation X in the new economy. Journal of European Industrial Training, Vol. 26, no. 2/3/4, pp. 125-129. [14] Demirović E., Musliu N. and Winter F. (2017). Modeling and solving staff scheduling with partial weighted maxSAT. Annals of Operations Research, no. 1, pp. 1-21. [15] Guo Y. and Nan L. I. (2012). Multiple projects scheduling method based on cloud multi-objective particle swarm optimization. Computer Engineering & Applications, vol. 48, no. 21, pp. 15-20. [16] Chen Y. P., Li Y., Wang G., Zheng Y. F., Xu Q., Fan J. H. and Cui X. T. (2017). A novel bacterial foraging optimization algorithm for feature selection. Expert Systems with Applications An International Journal, vol. 83, no. C, pp. 1-17. Application for Viral Hepatitis Infection Risk Assessment -HEPY Alen Ajanović, Andrej Ulčar, Ana Peterlin, Karolina Počivavšek and Gašper Fele-Žorž Faculty of Computer and Information Science, Večna pot 113, 1000 Ljubljana, Slovenia E-mail: polz@fri.uni-lj.si Anton Gradišek and Matjaž Gams Institut Jožef Stefan, Jamova cesta 39, 1000 Ljubljana, Slovenia E-mail: matjaz.gams@ijs.si Mojca Matičič Clinic for Infectious Diseases and Febrile Illnesses, University Medical Centre Ljubljana Japljeva 2, 1525 Ljubljana Slovenia E-mail: mojca.maticic@kclj.si Keywords: viral hepatitis, infection, health education, web application, preventive medicine Received: June 2, 2018 We present a web application to inform users about different types of viral hepatitises. The core of the application is a questionnaire about past behavior and risk factors. Based on the answers, it produces a personalised overview of any risky actions that the user might have taken in the past. The site also contains general information about these diseases, which can help users identify them or seek proper precautions in order to avoid them. Povzetek: Predstavljamo spletno aplikacijo za informiranje uporabnikov o različnih tipih virusnega hepatitisa. Središče aplikacije je vprašalnik, ki na podlagi odgovorov in dejanj uporabnika poda osebno oceno stopnje nevarnosti za različne okužbe hepatitisa. Stran vsebuje tudi splošne informacije o teh boleznih, ki lahko uporabniku pomagajo pri iskanju hitre in ustrezne pomoči. 1 Introduction Hepatitis is an inflammation of the liver. Hepatitis viruses identify the infected individuals as soon as possible and are the most common cause of hepatitis. It presents an act accordingly. important global healthcare problem as it has been Collecting and identifying valid health information estimated to affect 330 million people worldwide [1]. In online is difficult. Patients would be much better served Slovenia, it is estimated that less than 1% of the population by a centralized site which is endorsed by medical is infected with HBV (hepatitis B virus) and around 0.4% professionals and contains all the relevant information in with HCV (hepatitis C virus) [2]. They are of greatest one place. In this paper, we present a web application concern because of the burden of illness and death they (https://hepy.mf.uni-lj.si/) that aims to educate users about cause and the potential for outbreaks and epidemic spread. viral hepatitis infections and to assess possible risks of Viral hepatitis may be present as an acute or chronic infection, in a safe anonymous environment. The disease. In acute disease which may occur with mild or no application is a follow-up of an application to educate symptoms, or may include symptoms such as jaundice, about and assess risks for sexually transmitted infections dark urine, pale stool, extreme fatigue, nausea, vomiting ASPO [4], built on an improved platform with modified and abdominal pain. Acute HBV and HCV infection are goals. We discuss the implementation and the most likely to become ongoing and chronic and are the functionality in the subsequent sections. most common cause of liver cirrhosis and hepatocellular carcinoma, which can lead to liver failure and death [3]. 2 Related work Due to mild and nonspecific signs and symptoms, patients To date, the number of web applications dealing with with chronic viral hepatitis typically do not know they are assessment of risk for viral hepatitis infection and infected and are the main source of spreading the disease, informing general population on different types of viral mostly by risky behavior. hepatitises is limited. A descriptive observational study of The main problem with hepatitis virus infections is available viral hepatitis smartphone applications was that the patients do not know they are infected until the carried out by Cantudo-Cuenca et al. in 2013 [5]. They disease has already developed to an advanced stage of identified 232 applications related to viral hepatitis in liver failure, at which point it becomes difficult to treat. Google Play Store (Android) and Apple App Store (iOS) Since serious liver complications of chronic hepatitis of which 33 were selected for further analysis. Most of infections can be prevented or managed if the disease is these apps were uploaded under the medical category. detected and treated in its early stages, it is important to Only 6 apps had exceeded 1000 downloads. A total of 12 apps were aimed at health professionals, while 4 focused on patients (7 on both of them). The participation of health professionals in the development of apps was 57% [5]. Lack of professional healthcare involvement and lack of public organisation participation in the development of such applications is raising concern regarding the reliability and accuracy of their medical content [6]. 3 Website description 3.1 General information The main purpose of our web application is to provide static information about different hepatitis diseases and general guidelines on how to identify the symptoms, as well as avenues to seek help if need-be. It provides a general overview of each individual disease as well as all recommended steps needed in order to avoid contracting one. This makes it useful both for a potentially sick patient, as well as a healthy person, as it provides necessary curative and preventative information. Figure 1: Landing page of the website. 3.2 Questionnaire The aimofthe questionnaire istoanalyse one’ssymptoms and provide personal feedback based on the given answers. The questions are formed in a way that makes the user reflect on their actions and become acutely aware of the behavior they have exhibited in the past which might have led up to them contracting a potential disease. This is enforced by short messages which appear as the user is answering particular questions, informing them that such behavior is risky and why. At the end, users get a general overview of their answers and how risky their actions were, using color-coding for severity; green, orange, or red. System description 4.1 Frontend The frontend uses Bootstrap [7] for styling and AngularJS [8] for dynamic content delivery. These technologies allow the page to seamlessly display information and track the user’s progress even in the case of a temporary server or internet outage. Each answer is only recorded on the client during the solving of the questionnaire. When finished, the user is prompted and can decide whether to permanently delete the data or to send it anonymously to A. Ajanović et al. the server for statistical analysis. If the user does not agree, no data is stored or sent to the server. After completing the questionnaire, they get their final risk assessment that consists of a personalised message based on their answers and risk factors. These may include not being vaccinated, travelling to foreign countries, etc. 4.2 Backend Django [9] was used for the backend in order to allow a combination of a static website and a dynamic questionnaire, as well as to make potential future improvements easier. The questionnaire contains single and multiple-choice questions. Since there is a large number of questions in total and some are mutually exclusive, answers to certain questions disable other questions. For example, as mandatory vaccination for HAV and HBV was introduced in 1993, people born after this year are not asked about their vaccination status. Similarly, people who did not travel to high-risk countries are not asked about their experience during such trips. This approach allows us to set an intricate web of questions while minimising the amount of time it takes for the user to answer them while still giving full feedback without skimping out on the important details. For easier access and maintenance, rules and other data regarding questions, along with the questions themselves, are stored in a relational database and made accessible via the Django REST framework [10]. 5 Methodology 5.1 Designing the database At first, we had to design our system very generally to allow the administrator to design the website as liberally as possible. This meant designing the structure in a such a way which would allow for possible expansion, as well as allow quick changes to content that is already available on the site. The goal was to create an interface where the administrator could log in with their credentials and change anything on the website at a moment’s notice. Django already provides one such feature in the shape of an administrator page. Once set up, it allows a quick overview of content on the website and running rudimentary queries directly through the page itself (such as inserting new instances of data, altering the existing ones, etc.). The first obstacle we faced is localisation, if we want the web page to be easily translatable and accessible via the admin page, all of the content would have to be stored inside of the database. This is not a problem in and of itself, but it does require some extra work to fetch the content, translate it, and put it back in. We had to sacrifice that accessibility in favour of the content being easily alterable, mostly because we don’t expect translations to happen often, but it can be very important to update the content in case of new relevant medical information. Another obstacle in designing the database was the functioning logic of the questionnaire itself. We could not afford to statically assign the questions as they were continually altered, switched, or completely scrapped even in the process of designing the site. We had to come up with a system that would allow us to assign different types of questions with different values regarding risk assessment, change the order and even which questions can preclude others from being shown based on their answers. All of this had to be easily adjustable by the administrator. 5.2 Designing the questionnaire In designing the aforementioned system there were a couple of options that we initially considered. We wanted to support different types of questions, so the easiest way to encode that was to have a number of discrete values and some type describing what the question was in the database. We could then programmatically extract the data and insert the appropriate question based on type in our frontend part of the application. This means that inserting a completely new type would require writing code, but for the purposes of our questionnaire, we mainly focused on discrete questions since they allow us to both inform the user, as well as guide them through the process of answering them. This also means that the administrator can switch between types with a single query, for example, changing a “radio” type questions to a “checkbox”. The second part was enabling a system that could potentially exclude other questions based on the user’s given answers. At first, we wanted questions to be separate entities that would be connected somehow, but we opted for a more flexible approach and utilised a queue. Every question has an assigned id and an order property. When the user begins answering questions, we construct a priority queue based on this order. Each question is comprised of entities called “answers”. Every answer has a property which contains a (possibly empty) set of references (ids) to the questions which should not be should if this answer is chosen. For example, if the user selects a certain answer and its exclusionID is set to 3, we will remove question with id=3 from the queue. As the user progresses through the questionnaire, this question is omitted. This also allows the user to go back and change their answer, making question 3 reappear unless some other answer disables it. It is also possible to alter the order of the questions. The administrator can do it by hand by simply changing the priority, but it is also possible for a particular answer to trigger the change. We grouped questions into clusters that have similar semantic meanings. If an entity has a special flag that rearranges these, it is possible for an answer to alter the order in which these groups appear in. This is simply done by having another property which tells us which group should come first. Conclusion As the number of hepatitis infections increases, we need to find a way to inform the general public and provide its members with medically accurate information in order to combat the disease before it develops too far. While the internet holds a lot of information, not all of it is Informatica 42 (2018) 279–281 281 completely medically accurate. The aim of the application is to readily provide all relevant information to users, both identifying potential risk factors or symptoms and learning about preventative measures, while at the same time offering a safe anonymous online environment. With the introduction of this website, we hope to bridge the gap of knowledge and urge users to seek appropriate help and spread this valuable information to others. Acknowledgement The project was held under public tender "Po kreativni poti do praktičnega znanja" (Creative path to practical knowledge), funded by Ministry of Education, Science and Sport, and European Union. We thank other participants in the project: Matic Podpadec, Ana Prodan, Saša Rink, and Špela Pišek. References [1] WHO http://www.who.int/mediacentre/factsheets/ Accessed: 22-05-2018 [2] J. Tomažič and F. Strle, Infekcijske bolezni, vol. 1. Ljubljana: Združenje za infektologijo, Slovensko zdravniško društvo Ljubljana 2014/2015. A. Wasley, S. Grytdal, and K. Gallagher, “Surveillance for Acute Viral Hepatitis -United States, 2006,” MMWR Surveillance summaries, 2008.[Online]. Available: https://www.cdc.gov/ mmwr/preview/mmwrhtml/ss5702a1.htm. Accessed: 23-05-2018. [3] N. A. Terrault, N. H. Bzowej, K.-M. Chang, J. P. Hwang, M. M. Jonas, and M. H. Murad, “AASLD guidelines for treatment of chronic hepatitis B,” Hepatology, vol. 63, no. 1, pp. 261–283, Jan. 2016. [4] Ajanović,A.,Konda,J.,Fele-Žorž,G.,Gradišek, A., Gams,M.,Peterlin, A.,...&Matičič,M.Application for sexually transmitted infection risk assessment. Informatica, 41(2), June 2017. [5] Cantudo-Cuenca MR, Robustillo-Cortés MA, Cantudo-Cuenca MD, Morillo-Verdugo R. A better regulation is required in viral hepatitis smartphone applications. Farm Hosp; 38(2):112-7. April 2014. [6] S. D. Burdette, T. E. Herchline, R. Oehler, “Surfing the web practicing medicine in a technological age: using smartphones in clinical practice,” Clin Infect Dis; vol. 47, pp. 117-122, Jul. 2008. [7] Bootstrap CSS, https://getbootstrap.com/css/ Accessed: 2018-19-05. [8] AngularJS, https://angularjs.org/ Accessed: 2018­19-05. [9] Django, https://www.djangoproject.com/ Accessed: 2018-19-05. [10] Django REST framework, http://www.django-rest­ framework.org/, Accessed: 2018-19-05. JOŽEF STEFAN INSTITUTE Jožef Stefan (1835-1893) was one of the most prominent physicists of the 19th century. Born to Slovene parents, he obtained his Ph.D. atVienna University, where he was later Director of the Physics Institute, Vice-President of theVienna Academyof Sciences anda memberof several scientifc institutions in Europe. Stefan explored many areas in hydrodynamics, optics, acoustics, electricity, mag­netism and the kinetic theory of gases. Among other things, he originated the law that the total radiation from a black body is proportional to the 4th power of its absolute tem­perature, known as the Stefan–Boltzmann law. The Jožef Stefan Institute (JSI) is the leading indepen­dent scientifc research institution in Slovenia, covering a broad spectrum of fundamental and applied research in the felds of physics, chemistry and biochemistry, electro­nics and information science, nuclear science technology, energy research and environmental science. The JožefStefanInstitute(JSI)isaresearchorganisation for pure and applied research in the natural sciences and technology. Both are closely interconnected in research de­partments composed of different task teams. Emphasis in basic researchisgiventothedevelopmentand educationof young scientists, while applied research and development serve for the transfer of advanced knowledge, contributing to the development of the national economy and society in general. At present the Institute, with a total of about 900 staff, has700 researchers,about250ofwhomare postgraduates, around 500 of whom have doctorates (Ph.D.), and around 200 of whom have permanent professorships or temporary teaching assignments at the Universities. In view of its activities and status, the JSI plays the role of a national institute, complementing the role of the uni­versities and bridging thegap between basic science and applications. Research at the JSI includes the following major felds: physics; chemistry; electronics, informatics and compu­ter sciences; biochemistry; ecology; reactor technology; applied mathematics. Most of the activities are more or less closely connected to information sciences, in particu­lar computer sciences, artifcial intelligence, language and speech technologies, computer-aided design, computer ar­chitectures, biocybernetics and robotics, computer automa­tion and control, professional electronics, digital communi­cations and networks, and applied mathematics. The Institute is located in Ljubljana, the capital of the in­dependent state ofSlovenia (orS¦nia). The capital today is considered a crossroad between East, West and Medi- Informatica 42 (2018) 283–283 283 terranean Europe,offeringexcellent productivecapabilities and solidbusiness opportunities, with strong international connections. Ljubljana is connected to important centers such as Prague, Budapest,Vienna, Zagreb, Milan, Rome, Monaco, Nice, Bernand Munich,all withina radiusof600 km. From the Jožef Stefan Institute, the Technology park “Ljubljana” has been proposed as part of the national stra­tegy for technological development to foster synergies be­tween research and industry, to promote joint ventures be­tween university bodies, research institutes and innovative industry, to act as an incubator for high-tech initiatives and to acceleratethedevelopmentcycleof innovative products. Part of the Institute was reorganized into several high-tech units supportedby and connected within theTechno-logy park at the Jožef Stefan Institute, established as the beginningofa regionalTechnology park "Ljubljana". The project was developed at a particularly historical moment, characterizedbythe processof state reorganisation,privati­sation and private initiative. The nationalTechnologyPark isa shareholding companyhosting an independentventure-capital institution. The promoters and operational entities of the project are the Republic of Slovenia, Ministry of Higher Education, Science andTechnology and the Jožef Stefan Institute. The framework of the operation also includes the University of Ljubljana, the National Institute of Chemistry, the Institute for Electronics andVacuum Technology and the Institute for Materials and Construction Research among others. In addition, the project is supported by the Ministry of the Economy, the National Chamber of Economy and the City of Ljubljana. Jožef Stefan Institute Jamova 39, 1000 Ljubljana, Slovenia Tel.:+38614773 900,Fax.:+38612519385 WWW:http://www.ijs.si E-mail: matjaz.gams@ijs.si Public relations: Polona Strnad INFORMATICA AN INTERNATIONAL JOURNAL OF COMPUTING AND INFORMATICS INVITATION, COOPERATION Submissions and Refereeing Please register as an author and submit a manuscript at: http://www.informatica.si. At least two referees outside the aut­hor’s country will examine it, and they are invited to make as many remarks as possible from typing errors to global philoso­phical disagreements. The chosen editor will send the author the obtained reviews. If the paper is accepted, the editor will also send an email to the managing editor. The executive board will inform the author that the paper has been accepted, and the author will send the paper to the managing editor. The paper will be pu­blished within one year of receipt of email with the text in Infor­matica MSWord format or InformaticaLATEXformat and fgures in.eps format.Styleandexamplesof paperscanbe obtainedfrom http://www.informatica.si. Opinions, news, calls for conferences, calls for papers, etc. should be sent directly to the managing edi­tor. SUBSCRIPTION Please, complete the order form and send it to Dr. Drago Tor-kar,Informatica, Institut Jožef Stefan, Jamova39, 1000 Ljubljana, Slovenia. E-mail: drago.torkar@ijs.si Since 1977, Informatica has been a major Slovenian scientifc journal of computing and informatics, including telecommunica­tions, automation and other related areas. In its 16th year (more than twentyfour years ago) it became truly international, although it still remains connected to Central Europe. The basic aim of In­formatica is to impose intellectual values (science, engineering) in a distributed organisation. Informatica is a journal primarily covering intelligent systems in the European computer science, informatics and cognitive com­munity; scientifc and educational as well as technical, commer­cial and industrial. Its basic aimis to enhance communications between different European structures on the basis of equal rights and international refereeing. It publishes scientifc papers accep­ted by at least two referees outside the author’s country. In addi­tion, it contains information about conferences, opinions, critical examinations of existing publications and news. Finally, major practical achievements and innovations in the computer and infor­mation industry are presented through commercial publications as well as through independent evaluations. Editing and refereeing are distributed. Each editor can conduct the refereeing process by appointing two new referees or referees from the Board of Referees or Editorial Board. Referees should not be from the author’s country. If new referees are appointed, their names will appear in the Refereeing Board. Informatica web edition is free of charge and accessible at http://www.informatica.si. Informatica print edition is free of charge for major scienti­fc, educational and governmental institutions. Others should subscribe. Informatica WWW: http://www.informatica.si/ Referees from 2008 on: A. Abraham,S. Abraham,R. Accornero,A. Adhikari,R. Ahmad,G. Alvarez,N. Anciaux,R. Arora,I.Awan,J. Azimi,C. Badica,Z. Balogh,S. Banerjee,G. Barbier,A. Baruzzo,B. Batagelj,T. Beaubouef,N. Beaulieu,M. ter Beek,P. Bellavista,K. Bilal,S. Bishop,J. Bodlaj,M. Bohanec,D. Bolme,Z. Bonikowski,B. Boškovi´c,M. Botta, P. Brazdil,J. Brest,J. Brichau,A. Brodnik,D. Brown,I. Bruha,M. Bruynooghe,W. Buntine, D.D. Burdescu,J. Buys,X. Cai,Y. Cai, J.C. Cano,T. Cao, J.-V. Capella-Hernández,N. Carver,M.Cavazza,R.Ceylan,A. Chebotko, I. Chekalov,J. Chen, L.-M. Cheng,G. Chiola,Y.-C. Chiou,I. Chorbev, S.R. Choudhary, S.S.M. Chow, K.R. Chowdhury,V. Christlein,W. Chu,L. Chung,M. Ciglariˇ c, J.-N. Colin,V. Cortellessa,J. Cui,P. Cui,Z. Cui,D. Cutting,A. Cuzzocrea,V. Cvjetkovic,J. Cypryjanski,L. ˇCerepnalkoski,I. ˇc,G. Daniele,G. Cehovin,D. ˇCosi´Danoy,M. Dash,S. Datt,A. Datta, M.-Y.Day,F. Debili, C.J. Debono,J. Dediˇc,P.Degano,A. Dekdouk,H. Demirel,B. Demoen,S. Dendamrongvit,T. Deng,A. Derezinska,J. Dezert,G. Dias,I. Dimitrovski,S. Dobrišek, Q. Dou, J. Doumen, E. Dovgan, B. Dragovich, D. Drajic, O. Drbohlav, M. Drole, J. Dujmovi´c, O. Ebers, J. Eder, S. Elaluf-Calderwood,E. Engstr,U. riza Erturk,A.Farago,C. Fei,L. Feng,Y.X. Feng,B. Filipiˇc,I. Fister,I. FisterJr.,D. Fišer,A. Flores,V.A.Fomichov,S.Forli,A. Freitas,J. Fridrich,S. Friedman,C.Fu,X.Fu,T. Fujimoto, G. Fung, S. Gabrielli, D. Galindo, A. Gambarara, M. Gams, M. Ganzha, J. Garbajosa, R. Gennari, G. Georgeson, N. Gligori´c, S. Goel, G.H. Gonnet, D.S. Goodsell, S. Gordillo, J. Gore, M. Grˇcar, M. Grgurovi´c, D. Grosse, Z.-H. Guan, D. Gubiani, M. Guid, C. Guo, B. Gupta, M. Gusev, M. Hahsler, Z. Haiping, A. Hameed, C. Hamzaçebi, Q.-L. Han,H. Hanping,T. Härder, J.N. Hatzopoulos,S. Hazelhurst,K. Hempstalk, J.M.G. Hidalgo,J. Hodgson, M. Holbl, M.P. Hong, G. Howells, M. Hu, J. Hyvärinen, D. Ienco, B. Ionescu, R. Irfan, N. Jaisankar, D. Jakobovic,K. Jassem,I.Jawhar,Y. Jia,T. Jin,I. Jureta,.. Juri´ciˇ´ c,S.K,S. Kalajdziski,Y. Kalantidis,B. Kaluža, D. Kanellopoulos,R. Kapoor,D. Karapetyan,A. Kassler, D.S. Katz,A.Kaveh, S.U. Khan,M. Khattak,V. Khomenko, E.S. Khorasani,I. Kitanovski,D.Kocev,J.Kocijan,J.Kollár,A.Kontostathis,P.Korošec,A. Koschmider, D.Košir, J.Kovaˇ c,A. Krajnc,M. Krevs,J. Krogstie,P. Krsek,M.Kubat,M.Kukar,A.Kulis, A.P.S. Kumar, H. Kwa´ snicka,W.K. Lai, C.-S. Laih, K.-Y. Lam,N. Landwehr,J. Lanir,A.Lavrov,M. Layouni,G. Leban, A.Lee,Y.-C.Lee,U.Legat,A. Leonardis,G.Li,G.-Z.Li,J.Li,X.Li,X.Li,Y.Li,Y.Li,S.Lian,L.Liao,C.Lim, J.-C.Lin,H.Liu,J.Liu,P.Liu,X.Liu,X.Liu,F.Logist,S.Loskovska,H.Lu,Z.Lu,X.Luo,M. Luštrek,I.V. Lyustig, S.A. Madani,M. Mahoney, S.U.R. Malik,Y. Marinakis,D. Marinciˇˇ c, J. Marques-Silva, A. Martin, D. Marwede, M. Matijaševi´ c,T. Matsui,L. McMillan,A. McPherson,A. McPherson,Z. Meng, M.C. Mihaescu,V. Milea,N. Min-Allah,E. Minisci,V. Miši´ c, A.-H. Mogos,P. Mohapatra, D.D. Monica,A. Montanari,A. Moroni,J. Mosegaard,M. Moškon,L.deM. Mourelle,H. Moustafa,M. Možina,M. Mrak,Y.Mu,J. Mula,D.Nagamalai, M.Di Natale,A.Navarra,P.Navrat,N. Nedjah,R. Nejabati,W.Ng,Z.Ni, E.S. Nielsen,O. Nouali,F.Novak,B. Novikov,P. Nurmi,D. Obrul,B. Oliboni,X.Pan,M.Panˇc, B.-K. cur,W.Pang, G.Papa, M.Paprzycki, M.ParaliˇPark,P.Patel,T.B. Pedersen,Z. Peng, R.G. Pensa,J. Perš,D. Petcu,B. Petelin,M. Petkovšek,D.Pevec,M. Piˇcan,M. Polo,V. Pomponiu,E. Popescu,D. Poshyvanyk,B. Potoˇ culin,R. Piltaver,E. Pirogova,V. Podpeˇcnik, R.J.Povinelli, S.R.M. Prasanna,K. Pripuži´c,G. Puppis,H. Qian,Y. Qian,L. Qiao,C. Qin,J. Que, J.-J. Quisquater,C. Rafe,S. Rahimi,V.Rajkovi ˇc,J. Ramaekers,J. Ramon,R.Ravnik,Y. Reddy,W. c, D. Rakovi´Reimche, H. Rezankova, D. Rispoli, B. Ristevski, B. Robiˇ c, J.A. Rodriguez-Aguilar,P. Rohatgi,W. Rossak,I. Rožanc,J. Rupnik, S.B. Sadkhan,K. Saeed,M. Saeki, K.S.M. Sahari,C. Sakharwade,E. Sakkopoulos,P. Sala, M.H. Samadzadeh, J.S. Sandhu,P. Scaglioso,V. Schau,W. Schempp,J. Seberry,A. Senanayake,M. Senobari, T.C. Seong,S. Shamala, c. shi,Z. Shi,L. Shiguo,N. Shilov, Z.-E.H. Slimane,F. Smith,H. Sneed,P. Sokolowski, T. Song, A. Soppera, A. Sorniotti, M. Stajdohar, L. Stanescu, D. Strnad, X. Sun, L. Šajn, R. Šenkeˇrík, M.R. Šikonja,J. Šilc,I. Škrjanc,T. Štajner,B. Šter,V. Štruc,H.Takizawa,C.Talcott,N.Tomasev,D.Torkar,S. Torrente,M.Trampuš,C.Tranoris,K.Trojacanec,M. Tschierschke,F.DeTurck,J.Twycross,N. Tziritas,W. Vanhoof,P.Vateekul, L.A.Vese,A.Visconti,B. Vlaoviˇc,V.Vojisavljevi´c,M.Vozalis,P. Vraˇcar,V. Vrani´c, C.-H. Wang,H.Wang,H.Wang,H.Wang,S.Wang, X.-F.Wang,X.Wang,Y.Wang,A.Wasilewska,S.Wenzel,V. Wickramasinghe,J.Wong,S. Wrobel,K. Wrona,B.Wu,L. Xiang,Y. Xiang,D. Xiao,F. Xie,L. Xie,Z. Xing,H. Yang,X.Yang, N.Y.Yen,C.Yong-Sheng, J.J.You,G.Yu,X. Zabulis,A. Zainal,A. Zamuda,M. Zand,Z. Zhang, Z. Zhao,D. Zheng,J. Zheng,X. Zheng, Z.-H. Zhou,F. Zhuang,A. Zimmermann, M.J. Zuo,B. Zupan,M. Zuqiang, B. Žalik, J. Žižka, Informatica An International Journal of Computing and Informatics Web edition of Informatica may be accessed at: http://www.informatica.si. Subscription Information Informatica (ISSN 0350-5596) is published four times a year in Spring, Summer, Autumn,andWinter(4 issuesperyear)bytheSloveneSociety Informatika, Litostrojska cesta54,1000 Ljubljana, Slovenia. The subscription rate for 2018 (Volume 42) is – 60 EUR for institutions, – 30 EUR for individuals, and – 15 EUR for students Claims for missing issues will be honored free of charge within six months after the publication date of the issue. Typesetting: Borut Žnidar. Printing: ABO grafka d.o.o., Ob železnici 16, 1000 Ljubljana. Ordersmaybeplacedbyemail (drago.torkar@ijs.si), telephone(+38614773900)orfax(+38612519385).The payment shouldbemadetoourbank accountno.: 02083-0013014662atNLBd.d.,1520 Ljubljana,Trgrepublike 2, Slovenija, IBAN no.: SI56020830013014662, SWIFT Code: LJBASI2X. Informaticais publishedby Slovene Society Informatika (president Niko Schlamberger)in cooperation with the following societies (and contact persons): Slovene Society forPattern Recognition (Simon Dobrišek) Slovenian Artifcial Intelligence Society (Mitja Luštrek) Cognitive Science Society (Olga Markiˇ c) Slovenian Society of Mathematicians, Physicists and Astronomers (Marej Brešar) Automatic Control Society of Slovenia (Nenad Muškinja) Slovenian AssociationofTechnicaland Natural Sciences/Engineering AcademyofSlovenia (StanePejovnik) ACM Slovenia (Matjaž Gams) Informatica is fnancially supported by the Slovenian research agencyfrom the Call for co-fnancing of scientifc periodical publications. Informaticais surveyedby:ACM Digital Library, Citeseer, COBISS, Compendex, Computer&Information Systems Abstracts, Computer Database, Computer Science Index, Current Mathematical Publications, DBLP Computer Science Bibliography, Directory of Open Access Journals, InfoTrac OneFile, Inspec, Linguistic and Language Behaviour Abstracts, Mathematical Reviews, MatSciNet, MatSci on SilverPlatter, Scopus, Zentralblatt Math Volume42 Number2June 2018 ISSN 0350-5596 Counterexamplesin Model Checking –ASurvey Evaluation of Medical Image Algorithms on Multicore Processors AModifcation of the Lasso Method by Using the Bahadur Representation for the Genome-wide Association Study PSO with Crossover Operator Applied to Feature Selection Problem in Classifcation AMulti-Agent based Approach for Simulating the Impact of Human Behaviours on Air Pollution Fast Artifcial Bee Colonyfor Clustering APairing Free Secure Identity-based Aggregate Signature Scheme Under Random Oracle Static and Incremental Overlapping Clustering Algorithms for Large Collections Processing in GPU Weighted Density Center Clustering Protocol for Wireless Sensor Networks Persistent Homology and Machine Learning Research on Intelligent English OralTraining System in Mobile Network Probability Matrix Decomposition Based Collaborative Filtering Recommendation Algorithm The Decision Model for the Optimal Confguration Management of Knowledge Employees in Enterprises Application forViral Hepatitis Infection Risk Assessment -HEPY H. Debbi 145 D. Demirovi´c, Z. Šabanovi´c 167 L.V. Utkin, Y.A. Zhuk 175 H. Hichem, M. Rafk, M.T. Mesaaoud 189 S. Ghazi, J. Dugdale, T. Khadir 199 A.S. Girsang, Y. Muliono, F. Sa 211 E. Abouelkheir, J.G. Tromp 221 L.J. González-Soler, A.Pérez-Suárez, L. Chang 229 A. Slimani, M. Redjimi, D. Slimani 245 P. Škraba 253 F. Zhu 259 Y. Tan, H. Zhao, Y. Wang, M. Qiu 265 C. Yang, Y. Yang, X. Tan 273 A. Ajanoviˇc, A. Ulˇcar, A. Peterlin, K. Poˇcivavˇcek, G. Fele-Žorž, A. Gradišek, M. Gams, M. Matiˇciˇc 279 Informatica 42 (2018) Number 2, pp. 145–283