Special Issue: Information and Communication Technology Guest Editors: Huynh Thi Thanh Binh Ichiro Ide Editorial Boards Informatica is a journal primarily covering intelligent systems in the European computer science, informatics and cognitive com­munity; scientifc and educational as well as technical, commer­cial and industrial. Its basic aim is to enhance communications between different European structures on the basis of equal rights and international refereeing. It publishes scientifc papers ac-ceptedbyat leasttwo referees outsidethe author’s country.Inad­dition, it contains information about conferences, opinions, criti­calexaminationsofexisting publicationsandnews. Finally,major practical achievements and innovations in the computer and infor­mation industry are presented through commercial publications as well as through independent evaluations. Editing and refereeing are distributed. Each editor from the Editorial Board can conduct the refereeing process by appointing two new referees or referees from the Board of Referees or Edi­torial Board. Referees should not be from the author’s country. If new referees are appointed, their names will appear in the list of referees. Each paper bears the name of the editor who appointed the referees. Each editor can propose new members for the Edi­torial Board or referees. Editors and referees inactive for a longer period can be automatically replaced. Changes in the Editorial Board are confrmed by the Executive Editors. The coordination necessary is made through the ExecutiveEdi-tors whoexamine the reviews, sort the accepted articlesand main­tain appropriate international distribution. The Executive Board is appointed by the Society Informatika. Informatica is partially supported by the Slovenian Ministry of Higher Education, Sci­ence andTechnology. Each author is guaranteed to receive the reviews of his article. When accepted, publication in Informatica is guaranteed in less than one year after the Executive Editors receive the corrected version of the article. Executive Editor – Editor in Chief Matjaž Gams Jamova 39, 1000 Ljubljana, Slovenia Phone: +38614773 900,Fax: +38612519385 matjaz.gams@ijs.si http://dis.ijs.si/mezi/matjaz.html Editor Emeritus AntonP. Železnikar Volariˇceva 8, Ljubljana, Slovenia s51em@lea.hamradio.si http://lea.hamradio.si/~s51em/ Executive Associate Editor -Deputy Managing Editor Mitja Luštrek, Jožef Stefan Institute mitja.lustrek@ijs.si Executive Associate Editor -Technical Editor DragoTorkar, Jožef Stefan Institute Jamova 39, 1000 Ljubljana, Slovenia Phone: +38614773 900,Fax: +38612519385 drago.torkar@ijs.si Contact Associate Editors Europe, Africa: Matjaz Gams N. and S. America: Shahram Rahimi Asia, Australia: Ling Feng Overview papers: Maria Ganzha,Wies awPaw owski, Aleksander Denisiuk Editorial Board Juan Carlos Augusto (Argentina) Vladimir Batagelj (Slovenia) Francesco Bergadano (Italy) Marco Botta (Italy) Pavel Brazdil (Portugal) Andrej Brodnik (Slovenia) Ivan Bruha (Canada) Wray Buntine (Finland) Zhihua Cui (China) Aleksander Denisiuk (Poland) Hubert L. Dreyfus (USA) Jozo Dujmovi´ c (USA) Johann Eder (Austria) George Eleftherakis (Greece) Ling Feng (China) VladimirA.Fomichov (Russia) Maria Ganzha (Poland) Sumit Goyal (India) Marjan Gušev (Macedonia) N. Jaisankar (India) Dariusz Jacek Jakczak (Poland) Dimitris Kanellopoulos (Greece) Samee Ullah Khan (USA) Hiroaki Kitano (Japan) IgorKononenko (Slovenia) MiroslavKubat (USA) Ante Lauc (Croatia) Jadran Lenarciˇˇ c (Slovenia) Shiguo Lian (China) Suzana Loskovska (Macedonia) Ramon L. de Mantaras (Spain) Natividad Martínez Madrid (Germany) Sando Martinciˇ´c (Croatia) c-Ipiši´Angelo Montanari (Italy) Pavol Návrat (Slovakia) Jerzy R. Nawrocki (Poland) Nadia Nedjah (Brasil) Franc Novak (Slovenia) MarcinPaprzycki (USA/Poland) Wies awPaw owski (Poland) Ivana Podnar Žarko (Croatia) Karl H. Pribram (USA) Luc De Raedt (Belgium) Shahram Rahimi (USA) Dejan Rakovi´ c (Serbia) Jean Ramaekers (Belgium) Wilhelm Rossak (Germany) Ivan Rozman (Slovenia) Sugata Sanyal (India) Walter Schempp (Germany) Johannes Schwinn (Germany) Zhongzhi Shi (China) Oliviero Stock (Italy) RobertTrappl (Austria) TerryWinograd (USA) Stefan Wrobel (Germany) Konrad Wrona (France) XindongWu (USA) Yudong Zhang (China) Rushan Ziatdinov (Russia&Turkey) IJCAI 2018 – Chinese Dominance Established Matjaž Gams Jožef Stefan Institute, Jamova 39, 1000 Ljubljana, Slovenia E-mail: matjaz.gams@ijs.si Editorial In July 2018 in Stockholm, ICML, AAMAS, ICCBR and SoCS joined with IJCAI and ECAI to establish the first major worldwide AI event. This paper is about the resulting IJCAI-ECCAI event [1]. The 27th International Joint Conference on Artificial Intelligence and the 23rd European Conference on Artificial Intelligence merged with the other events to form a single conference. Around 7000 participants divided their time between these conferences over 14 days as one fee covered the entrance to all the events. As a consequence, several researchers attended several conferences, which in itself was a major achievement. Namely, the conferences and even the individual sections of the conferences are becoming so specialized that AI researchers are becoming oblivious to the achievements being made in a related area, leading to specialization and small incremental improvements, thereby deterring major innovations. Fortunately, in 2018 there was a serious attempt to reintegrate the field. For the organizers, the super-event joint conferences represented a huge effort, but everything ran smoothly – albeit with a couple of small exceptions, as usual. One of them was the initial robot dance, where a Nao robot performed a predefined sequence of moves, which the human dancer enriched with dynamics and scope. The glitch was a loss of sound during the event (deliberate or by accident?). Added to this, the lack of any AI in the performance was a huge issue for many of the participants; in particular, the absence of true AI, one of the central themes of the conference. However, the artistic impression was there. Perhaps not surprisingly, as the small Nao robot was clearly physically and dynamically very much inferior to the flexible human dancer, a kind of reverse of David and Goliath seemed to be taking place. Also, the big 1000+ lecture rooms were organized in such a way that at no time was everybody sitting down, instead there seemed to be 5-10 people in motion at any moment. That aside, Stockholm is a traditional, open, metropolitan city that has hosted conferences for up to several tens of thousands of participants before, and the AI organizers have extensive experience as well; so by any measure the event must be considered as an organizational success. The IJCAI-ECCAI joint event involved a record 3470 submitted papers: 37 % more than in 2017, while the 2017 event was 11% up on the previous year, confirming the steady growth from 2007 on. AI continues to progress as a scientific field and as an area of human interest. The first major technical impression in 2018 was that Chinese dominance has finally been established. Eclatantly! In 2017, 37% of the papers were Chinese, while a year later this figure was 46%. Only 9% increase, one might say, but the 2017 conference was in Australia, with strong Chinese ties, while Stockholm is in Europe, and it was a joint European and international IJCAI conference, meaning around half of the event was basically a European conference. Despite that, European and American papers constituted around 20% each, while several authors, in particular from the USA, were also Chinese. Astonishment and admiration are the right words to describe this Chinese success. The more detailed numbers are as follows: from the 710 accepted papers (21% acceptance rate), 325 came from China, 129 from EU (UK 37, France 22, Italy 18, Germany 15, Austria 12), USA 122, Singapore 26, Australia 23, Japan 17, Israel 13, etc. When asked if it is reasonable to limit non-European papers at least for the ECCAI conference, say to 50%, several of the researchers expressed concern that it would mean that several of the best non-European papers would then end up being presented at other conferences. Several of the Chinese papers were indeed of high quality, demonstrating Chinese innovation, good education and the major support for AI in China. There were some concerns that the Chinese papers often follow a pattern with a specific idea, lots of complicated mathematics and an unverified empirical comparison. But that is true for many other papers as well. It should be noted, in addition, that due to several national European research policies, it is often nearly worthless for domestic evaluations to publish a paper at IJCAI or ECAI, since all that counts for these researchers are journal publications. The absence of more high-quality European papers might therefore be partially attributed to the strange European scientific policies. Some of these issues were discussed at the panels, as presented in Figures 1 and 2. Figure 1: At a panel about European AI, the importance of the field for European progress was clearly established. In some presentation, e.g., the one shown in Figure 1, the European position and the self-evaluation were not exactly in accordance with the percentage of conference papers. Some other positions even sounded a bit like a clip from a galaxy far, far away. But in reality, the panels were of high quality and several essential issues and initiatives were raised. Several panelists mentioned that there is no AI coordination in Europe, even though the EU is still the no. 1 world economy. In terms of AI funding, the USA prevails over China 2:1, and China prevails over the EU, again by 2:1. Such estimates might be misleading since the nominal comparison took place – instead, real economy (how many kilograms of sugar or of steel) already puts China above USA in terms of scientific funding. There are two important differences between the USA and the EU: the USA executes bold international policies, whereas the EU finds its soft approach is sometimes hurting its economy and society. The EU used to be no.1 in computer science; now it is no. 3. Lots of this falling behind was not necessary at all; instead there are subjective leadership reasons for the decline, e.g., the EU patent system is enormously complicated and bureaucratic compared to the American one. Another problem: the UK has the best European AI based on many criteria, and so Brexit will make this situation worse for the EU. Whereas top EU projects like H2020 represent world-class research, and the EU is still leading in many areas of business and science, the strong scientific funding for key areas as well as policies to support them are lacking. Figure 2: EU strategy involves three elements: science/technology, socio-economic changes and the social framework. While the EU is as concerned with legal issues as it is with the research, China has significantly improved its AI efforts through governmental and private funding, and there is no major rift in the government. The democracies of several European countries and the USA are torn apart because of ideological and political antagonisms, instead of focusing on technical progress. For Chinese researchers, the road to success and obtaining a good position at home is to publish at major AI conferences, in major journals and join established researcher teams in the USA or Europe. For Europeans, it is possible to follow the Chinese path, but no European country offers a several-times-higher salary for researchers returning home, like China does. While the presidents of superpowers from the USA to Russia declare the tremendous importance of AI in relation to world dominance, the percentage of papers best demonstrates who supports the field the most. This is not to say that all major countries are not increasing their AI funds significantly. For example, the EU has presented its plans at IJCAI (Figure 3): first, a 70% increase, followed by a 100% and then another 100%. The US Department of Defense (DoD) established the Joint AI Center (JAIC). It will host the DoD’s 600 AI projects with an estimated $1.7 billion over 6 years. As predicted, AI will likely change the nature of welfare, along with several other fields. However, without sufficient AI research, nobody can expect to maintain its leading position in the world. Figure 3: The EU will significantly increase AI funding. Finally. Will national governments follow? A closer look at the EU plan reveals that there are several new ideas, as presented in Figure 4. Among others, the EU will fund the open AI platform, which is at least partially influenced by Elon Musk’s, which by the way won the first 5 vs 5 Dota2 game with expert players (some small additional limitations). The EU plan was probably the major AI strategy presented at the conference. While China does it on its own and the USA allocates most funds to military applications, the EU is focused on a public, general, AI-boosting plan to benefit everybody. That is for sure great news, not only for AI in Europe, but for humanity as a whole. Figure 4: The EU strategy introduces several integrating EU components, including the AI toolbox and the Network of Digital Innovation Hubs. Unfortunately, many of the most advanced AI hubs are in the UK. Several new mechanisms like CLAIRE are already active (https://claire-ai.org/): “an initiative by the European AI community that seeks to strengthen European excellence in AI research and innovation.” “If Europe were to fall behind in AI technology, we would be likely to face challenging economic consequences, an academic brain drain, reduced transparency, and increasing dependency on foreign technologies, products and values. The CLAIRE initiative presents a proposal to avoid that.” “The CLAIRE initiative aims to establish a pan-European network of Centres of Excellence in AI, strategically located throughout Europe, and a new, central facility with a state-of-the-art, “Google-scale”, CERN-like infrastructure – the CLAIRE Hub – that will promote new and existing talent and provide a focal point for the exchange and interaction of researchers at all stages of their careers, across all areas of AI. The CLAIRE Hub will not be an elitist AI institute with a permanent scientific staff, but an environment where Europe’s brightest minds in AI meet and work for limited periods of time. This will increase the flow of knowledge among European researchers and back to their home institutions.” Maybe, we should also remember the times when science was not a business, when we researched not for the purpose of cash, but for reasons of fundamental curiosity, a desire to improve our knowledge. Some spirit of that kind is still observable at the conferences and was also demonstrated, for example, by the computer chess tournament. During the breaks many participants occasionally stopped by and observed the most interesting matches. Komodo won the World Computer Chess Championship 2018 after a play­off with GridGinkgo. In third place was Jonny, due to a win over Leela Chess Zero. The latter was observed with much interest, due to having implemented AlphaZero for the PC. It was not a match for the best programs, instead it played out very differently – intuitively, lucidly and error-prone. Obviously, it lacked the power of the Google computers to validate its fancy ideas, often in the form of sacrifices. Figure 6 shows the Komodo team, who received the Shannon Trophy (and replica) from the chairman of the ICGA David Levy. Figure 6: Komodo was again the computer chess champion on PCs. Leela Chess Zero, a PC version of the Alpha Zero, played lucidly, but had no chance against the hard logic of Komodo. In Stockholm 2018 the social meetings of societies were boosting the exchange of information and cooperation, be it inside the EU or international societies. For example, the EurAI meeting (Figure 7), IJCAI AI societies meeting, IFIP meetings, etc. The IJCAI report should first of all be about scientific achievements. In 2018 there was distinct, albeit rather expected, progress. Indeed, there were plenty of reasonably novel improvements, and indeed the major theme was a challenging one: How to grow a mind, a true AI -but that was it. Quite enough for many, but a bit too classical for others. Furthermore, the AI influence on our everyday life has already achieved much greater impact than generally anticipated in the public opinion: every day AI makes around 100 trillion decisions, meaning it is thoroughly embedded into our society. Coupled with other ICT achievements, human society long ago developed into an information society – an integration of humans and ICT systems, and an integration of human society and technology. This is one of the reasons why nobody understands what is actually going on – social scientists understand society, while engineers and technical scientists understand technological M. Gams systems, and finally nobody understand the two embedded and integrated into one unity – kind of Borg stuff, just that the unifying essence is the web and ICT and AI services. Another analogy is related to computer chess – when humans play based on their own brains, the inferiority and inability to understand complex relations are evident. Only coupled with powerful computers and advanced AI programs can we hope to decipher the societal changes and trends, and propose good solutions. With regards to the novel applications, Tambe’s group stood out from many – their security AI designing daily schedules for airports, harbors and other relevant facilities is employed at several locations worldwide. It has even been given to 60 wildlife parks to cope better with poachers. That is one of the successful applications, accompanied by a huge mass of new research systems, e.g., a novel HW and SW embedded system connected to the patient’s spine that enables a paralyzed patient to stand up. New classes of applications are emerging, e.g., in visual tasks. DNNs can transform a human face into another, even create a new face never seen before. An animal, say a horse, can be camouflaged into zebra stripes and it can move freely around in a simulated video. Systems speak perfectly and listen better than humans; they can sharpen a picture or translate from voice online. Google search is using DNNs to capture the best answer to a question. On the other hand, there are seemingly bizarre simple problems that researchers have a hard time dealing with. While it is generally accepted that DNNs outperform humans in visual classification tasks, it is still a big problem to transfer one ML system based on examples from a specific hospital and specific scanning devices to another. The technical differences are small, causing human experts no problem, but for the DNNs these small details significantly impair the quality. Until recently, that is. At the conference, several solutions related to transfer learning, general AI and also real AI were presented and discussed. Why should AI systems not learn like children, gathering knowledge and learning from there on with a small number of examples, even a single one? There is shallow, i.e., current AI, deep AI (also claimed as shallow AI), real AI, and fake AI. The last one refers to chatbots, i.e., virtual assistants, where human operators often jump in communications and leave users under the impression that it is AI on its own. The real AI was one of the major themes of the conference, which is quite a big difference from the previous conferences, where the primary goal was to complete the tasks better than expert humans, be it chess or detecting malignant tissue patterns. Now the task is different – perform at the level of children aged a couple of years. While supervised learning clearly achieves top performance, compared to humans it needs far too many examples, which is not acceptable, at least for the slow humans. Similarly, reinforcement learning needs way too many trials. Furthermore, machines do not have common sense compared even to young children. In terms of the ban on autonomous weapons, more and more societies and countries are joining the ban. EurAI, as the union of all AI European societies and the second largest AI society in the world, also supports the ban. Informatica 42 (2018) 285–289 289 the union of all AI European societies, also joined the efforts. In 2018, the debate on banning autonomous weapons was held in the UN and in the European Parliament: https://www.stopkillerrobots.org/2018/07/parliamen ts-2/. The list of institutions supporting the ban is here: https://www.stopkillerrobots.org/coalition/. Figure 8:Tegmark’s view of the AI field. Another interesting approach is to generate general AI, as was the case in the 2018 IJCAI conference. Currently, the majority opinion among AI researchers is that general AI is possibly 10 years away. It is probably not that the AI community lacks computer power or finances; it is the novel ideas that we are striving for. There is also a reasonable consensus that AI can, could, should and will help humans solve major human societal problems. Scientists should avoid the politics, especially the discrepancies between different ideological or political tracks, and defer from attacking colleagues along these lines. Science should be kept as separate as possible from politics and ideology. With these words from Tegmark (Figure 8) we look into a bright EU AI, AI and human future. References [1] IJCAI 2018 conference (https://www.ijcai-18.org/). Special issue on “The Eighth International Symposium on Information and Communication Technology – SoICT 2017” Since 2010, the Symposium on Information and Communication Technology – SoICT has been organized annually. The symposium provides an academic forum for researchers to share their latest research findings and to identify future challenges in computer science. The best papers from SoICT 2015 and SoICT 2016 have been extended and published in the Special issue “SoICT 2015” and “SoICT 2016” of the Informatica Journal, Vol.40, No.2 (2016) and Vol. 41, No. 2 (2017). In 2017, SoICT was held in Nha Trang, Vietnam, during December 7–8. The symposium covered four major areas of research including Artificial Intelligence and Big Data, Information Networks and Communication Systems, Human-Computer Interaction, and Software Engineering and Applied Computing. Among 132 submissions from 22 countries, 64 papers were accepted for presentation at SoICT 2017. Among them, the following six papers were carefully selected, after further extension and additional reviews, for inclusion in this special issue. The first paper, “Spectrum utilization efficiency of elastic optical networks utilizing coarse granular routing” by Hai-Chau Le and Ngoc T. Dang investigated an elastic optical network that uses coarse granular routing based on coarse granular node architecture. The network takes advantages of both elastic optical networking and coarse granular routing technologies to cope with the trade-off between the link cost and the node cost in order to build a spectrum-and-cost efficient solution for future Internet backbone networks. The authors have evaluated the hardware scale requirement and the spectrum utilization efficiency of the network with typical modulation formats under various network and traffic conditions. The second paper, “Time-stamp incremental checkpointing and its application for an optimization of execution model to improve performance of CAPE” by Van Long Tran, Eric Renault, Viet Ha Hai, and Xuan Huyen Do presents an improvement of Discontinuous Incremental Checkpointing, and a new execution model for CAPE using new techniques of checkpointing. It contributes to improve the performance and make CAPE even more flexible. The third paper, “SHIOT: A novel SDN-based framework for the heterogeneous Internet of Things” by Hai-Anh Tran, Duc Tran, Linh-Giang Nguyen, Quoc-Trung Ha, Van Tong, and Abdelhamid Mellouk developed an SDN-based framework called SHIOT which relies on the ontology for examining the end-user requests and applies an SDN controller to classify flow scheduling over the task level. The fourth paper, “USL: A domain-specific language for precise specification of use cases and its transformations” by Chu Thi Minh Hue, Dang Duc Hanh, Nguyen Ngoc Binh, and Le Minh Duc introduces a domain-specific language named the “Use case Specification Language (USL)” to precisely specify use cases. The authors define the abstract syntax of USL using a metamodel together with OCL wellformedness rules and then provide a graphical concrete syntax for the usability goal. This paper also defines precise semantics for USL by mapping USL models to Labelled Transition Systems (LTSs). It opens a possibility to transform USL models to software artifacts such as test cases and design models. The fifth paper, “Effective deep multi-source multi-task learning frameworks for smile detection, emotion recognition and gender classification” by Dinh Viet Sang and Tran Bao Cuong proposes effective multi-task deep learning frameworks which can jointly learn representations for three tasks: smile detection, emotion recognition, and gender classification. The frameworks can be learned from multiple sources of data with different kinds of task-specific class labels. The sixth paper, “Alignment-free sequence searching over whole genomes using 3D random plot of query DNA sequences” by Da-Young Lee, Hae-Sung Tak, Han-Ho Kim, and Hwan-Gue Cho proposes a new alignment-free sequence comparison and search method to overcome the limitations of the alignment-based model. We hope that readers interested in Information and Communication Technology will find this Special Issue a useful collection of papers. Huynh Thi Thanh Binh Ichiro Ide https://doi.org/10.31449/inf.v42i3.2248 Informatica 42 (2018) 293–300 293 Spectrum Utilization Efficiency of Elastic Optical Networks Utilizing Coarse Granular Routing Hai-Chau Le and Ngoc T. Dang Posts and Telecommunications Institute of Technology, Hanoi, Vietnam Computer Communication Labs, The University of Aizu, Aizu-wakamatsu, Japan E-mail: chaulh@ptit.edu.vn, ngocdt@ptit.edu.vn Keywords: elastic optical network, optical cross-connect, spectrum selective switch, routing and spectrum assignment Received: March 29, 2018 In this paper, we have investigated an elastic optical network that uses coarse granular routing based on our recently developed coarse granular node architecture. The developed coarse granular optical cross-connect (OXC) architecture that enables routing bandwidth-flexible lightpaths coarse-granularly is based on coarser granular selective spectrum switches. The network takes the advantages of both elastic optical networking and coarse granular routing technologies to cope with the trade-off between the link cost and the node cost in order to build a spectrum-and-cost efficient solution for future Internet backbone networks. We have evaluated the hardware scale requirement and the spectrum utilization efficiency of the network with typical modulation formats under various network and traffic conditions. We also compared the spectrum utilization of our network to that of corresponding traditional WDM network and conventional elastic optical network. Numerical results verified that, similar to conventional elastic optical network, the proposed network offers a substantial spectrum saving comparing to traditional WDM network. Povzetek: Prispevek uvede izvirno elastično optično omrežje in analizira lastnosti kot učinkovitost. 1 Introduction The ever-increasing Internet traffic growth has been continuously spurred by newly emerged high-performance and bandwidth-killer applications such as 4k/HD/ultra-HD video, e-Science and cloud/grid computing [1, 2]. To cope with the explosive traffic increment and to support further mobility, flexibility and bandwidth heterogeneity, the necessity of cost-efficient and bandwidth-abundant flexible optical transport networks has become more and more critical [3, 4]. To scale up to Terabit/s, current optical transport networks based on current WDM technology with a fixed ITU-T frequency grid will encounter serious issues due to the stranded bandwidth provisioning, inefficient spectral utilization, and high cost [3]. Recent research efforts on optical transmission and networking technologies that are oriented forward more efficient, flexible, and scalable optical network solutions [4] can be categorized into two different approaches that are: 1) improving the link resource utilization/flexibility and 2) minimizing the node system scale/cost. The first approach which aims to enhance the spectrum utilization and the network flexibility is currently dominated by the development of elastic optical networking technology [5-12]. Elastic optical networks (EON) realize spectrum-and energy-efficient optical transport infrastructure by exploiting bitrate-adaptive spectrum resource allocation with flexible spectrum/frequency grid and distance-adaptive modulation [8, 9]. They are also capable of providing dynamic spectrum-effective and bandwidth-flexible end-to-end lightpath connections while offering Telcos (IT/communication service providers) the ability to scale their networks economically with the traffic growth and the heterogeneity of bandwidth requirement [10, 11]. However, EON is still facing challenges owing to the lack of architectures and technologies to efficiently support bursty traffic on flexible spectrum. It also requires more complicated switching systems and more sophisticated network planning and provisioning control schemes [12]. On the other hand, the second approach targets the development of cost-effective, scalable and large scale optical switching systems [13-18]. One of the most attractive direction is the use of coarse granular optical path (lightpath) switching [16-17] that can be realizable with optical/spectrum selective switching technologies [18]. Spectrum selective switches (SSSs) are available with multiple spectrum granularities which are defined as the number of switching spectrum bands. It is demonstrated that, with a common hardware technology (i.e. MEMS, PLC, LCoS, …), the hardware scale is increased dramatically as finer granular SSSs are applied. Coarser granular SSSs are simpler and more cost-effective but, their routing flexibility is limited more severely. Unfortunately, this routing limitation may seriously affect the network performance, especially in case of dynamic wavelength path provision. In other words, node hardware scale/cost reduction only can be attained at a cost of certain routing flexibility restriction. Hence, it is desirable to enhance the node routing flexibility while still keeping the hardware reduction as large as possible. Based on that, in order to exploit elastic optical networking and coarse granular switching for a realizing cost-efficient, spectrum effective and flexible optical transport network, we have recently developed a single-layer optical cross-connect architecture based on coarse granular switching spectrum selective switches. Elastic optical network that employs the developed OXC architecture is still capable of exploiting elastic optical networking technology while attaining a substantial hardware reduction. We have also evaluated the network spectrum utilization in various network scenarios such as single modulation format (BPSK, QPSK, 8QAM and 16QAM) and distance-adaptive scheme. Numerical evaluations verified that, like a conventional elastic optical network, the proposed network can obtain a significant spectrum saving comparing to the corresponding traditional WDM network. A preliminary version of this work with the proposal and limited basic numerical effectiveness evaluation of a bandwidth-flexible and coarse granular optical cross-connect architecture was presented at the SoICT conference [19]. 2 Elastic optical network utilizing coarse granular routing 2.1 Developed coarse granular routing OXC architecture [19] Most existing optical cross-connect systems are realized by optical selective switch technology which is one of the most popular and mature optical switching technologies. For constructing a high-port count OXC, multiple spectrum selective switches can be cascaded to create a higher port count SSS to overcome the limitation of commercially available SSS port count which is currently 20+ and unlikely will be substantially enhanced cost-effectively in the near future [4, 18]. Therefore, larger scale optical cross-connect system requires more and/or higher port count SSSs. Moreover, spectrum selective switches are still costly and complicated devices. SSS cost/complexity strongly relies on the number of switching spectrum bands per fiber (also called the spectrum granularity). Finer granular SSSs are more complicated as well as have greater hardware scale and consequently, become more expensive. Based on that observation, in order to exploit elastic optical network technology while keeping the hardware scale reasonably small, we have recently developed a coarse granular routing elastic optical cross-connect architecture (denoted as GRE network) for realizing flexible bandwidth large scale optical transport networks [19]. Figure 1 shows the developed OXC system in which, instead of using fine granular SSSs in traditional bandwidth-variable OXC in elastic optical networks, coarse granular spectrum selective switches are implemented to build a cost-efficient high-port count OXC system. Unlike neither traditional WDM networks that divide the spectrum into individual channels with the H.-C. Le et al. fixed channel spacing of either 50 GHz or 100 GHz specified by ITU-T standards nor elastic optical networks that employ a flexible frequency grid with a fine granularity (i.e. 12.5 GHz), the developed coarse granular routing elastic optical network employs the same flexible frequency grid but it routes lightpaths at the spectrum band level, so called “coarse” granular routing entity – GRE, through coarse granular OXCs; all spectrum slots of a band must be routed together as a single entity. 2.2 Hardware scale requirement Practically, the cost and the control complexity of WSS/SSS-based systems depend strongly on the switch scale. Hence, switch scale minimization plays a key role for creating cost-effective large-scale WSS/SSS-based OXCs. Among recently available commercial optical switch technologies for constructing wavelength selective switch and/or spectrum selective switch systems, MEMS-based system are known as one of the most popular and widely adopted technology. Hence, to estimate the effectiveness of our recently developed OXC architecture, for simplicity, we just consider MEMS-based spectrum selective switches whose hardware scale mainly depends on the number of necessary elemental MEMS mirrors. Furthermore, without the loss of generality, adding/ dropping portions which can be simple 1x2 SSSs or couplers are also neglected. The switch scale of OXC systems, consequently, is quantified by the total MEMS mirrors required by SSS components. We assume that the transmission bandwidth of a fiber is Cfiber, the channel spacing based on ITU-T frequency grid of traditional WDM network is GWDM (GWDM=100 GHz) and EON channel spacing is GEON (GEON << GWDM). The number of wavelengths per fiber, WWDM, in WDM network can be calculated as, .................... =(1) ........ while the number of spectrum slots per fiber, S, of elastic optical network is given by, ............ ..=(2) ........ Let W denote the size of coarse granular routing entity (i.e. GRE granularity), the number of spectrum slots per GRE, and let S be the total number of spectrum slots that can be accommodated in a fiber; 1.W.S and S is divisible by W and we have L=S/W (1.L.S) is the number of switching spectrum slots per fiber. Each mirror of a MEMS-based selective spectrum switch is dedicated to a spectrum slot (or spectrum band) and so, each SSS needs L MEMS mirrors. Note that all spectrum slots of a GRE are simultaneously switched by a mirror. Hence, total number of MEMS mirrors required in WDM OXC, elastic OXC and the proposed GRE architecture are calculated as following, ..-1 ........=..........(1+..)(3) ....-1 ........=....(1+..)(4) ....-1 ........=....(1+....)(5) where n is the input/output fiber number (n>0), M is the maximal selective switch size (i.e. port count) and L is the GRE granularity. Table 1 summarizes the switch scale calculating formulas. The formulations also imply that the total number of necessary mirrors of an SSS is decreased as the applied GRE granularity becomes greater or it means that applying coarser granular SSSs (SSSs with greater W) will help to reduce the switch scale of OXC systems. OXC Architecture Switching component Switch scale (Total mirror number) Switch element Total number Conventional WDM WSS ..-1..........(1+..).. Developed OXC Coarser granular SSS ..-1..(1+..).. ....-1..(1+..).... Elastic OXC SSS ..-1....(1+..).. Table 1: Switch scale calculation. Figure 2 shows the switch scale requirement of the developed OXC architecture, in terms of MEMS mirrors, with respect to both the number of input/output fibers (the port count) and the number of switching spectrum bands per fiber. The graph demonstrates that the hardware scale increases as the number of input fibers becomes greater. The hardware scale increment becomes much more significant if more number of switching bands per fiber (finer GRE granularity) is applied. Hence, a great deal of hardware scale/cost reduction can be achieved if the GRE granularity is limited at a reasonable value. It implies that coarse granular routing elastic optical network (using coarse granular SSSs) can be considered as a promising solution for creating cost-effective and bandwidth-abundant transport networks. Moreover, Figure 3 shows the hardware scale comparison of the three comparative OXC architectures that are traditional OXC, elastic OXC and coarse Informatica 42 (2018) 293–300 295 granular OXC when the WDM channel spacing is 100 GHz and the spectrum slots of EON is 12.5 GHz. Due to the use of large channel spacing, i.e. 100 GHz or 50 GHz, traditional OXC needs the smallest hardware scale. On the other hand, thanks to the reduction of the number of switching spectrum bands, coarse granular OXC needs fewer number of switching elements comparing to conventional elastic optical cross-connect. Obviously, the hardware scale reduction offered by coarse granular OXC is enhanced, especially when coarser granular routing is applied (greater GRE granularity). 2.3 Network routing operation Unlike conventional OXCs in WDM or EON networks, the developed GRE node suffers from a intra-node routing limitation due to the use of coarser granular spectrum selective. Figure 4 illustrates the routing principle of the developed coarse granular routing optical cross-connect architecture. In elastic optical network which uses the developed coarse granular routing node architecture (so called coarse granular routing elastic optical network), lightpaths of a spectrum band can be added/dropped flexibly and dynamically by 1x2 SSSs/optical coupler equipped for incoming and outgoing fibers and sliceable bandwidth variable transponders with the spectrum band capacity. Different from conventional elastic optical networks in which spectrum slots of each lightpath can be routed separately, in this network, whole spectrum slots of a spectrum band from an incoming fiber must be switched together as one entity due to the coarse granular routing restriction of spectrum selective switches. It means that all lightpaths which are assigned to spectrum slots of the same spectrum band have to be routed to a common output fiber. This restriction imposed by the spectrum band granularity of SSSs limits the routing flexibility of the proposed OXC architecture. The node routing flexibility depends on the SSS spectrum granularity. In coarse granular routing elastic optical network, finer SSS granularity can be applied to improve the node routing flexibility, however, utilizing finer granular SSSs may cause a rapid increase in hardware-scale/cost. Hence, the SSS granularity must be carefully determined while considering the balance the node routing flexibility against the hardware scale/cost. Figure 4: Coarse granular routing principle. On the other hand, similar to conventional elastic optical networks, the coarse granular routing elastic optical network also can support single or multiple modulation formats flexibly and dynamically. Each lightpath can be assigned to a pre-determined modulation format (single modulation format scenario) or an appropriate modulation format according to its distance (called distance-adaptive scenario). In distance-adaptive scheme, for a given traffic capacity, modulating optical signal with a higher-order format offers higher capacity per spectrum slot and consequently, requires less number of spectrum slots. It means that applying higher-order modulation format obtains higher spectrum efficiency but its optical transparent reach is shortened and consequently, more frequent regeneration and/or more regeneration resources are required. Inversely, utilizing lower-order modulation formats might lower the spectrum slot capacity and hence, may cause an increment in the required spectrum slot number. Hence, H.-C. Le et al. impact of the modulation format assignment scenarios on the network spectrum utilization needs to be clarified. 3 Spectrum usage evaluation 3.1 Theoretical spectrum utilization analysis In this section, we evaluate the spectrum utilization of three comparative optical networks including WDM, traditional EON and coarse granular routing elastic optical networks. Without the loss of generality, we assumed the following parameters. The channel spacing based on ITU-T frequency grid of traditional WDM network is 100 GHz (GWDM=100 GHz, the most popular frequency grid) and the lowest order modulation format (i.e. BPSK) is applied. Elastic optical network is deployed with a typical channel spacing of 12.5 GHz (GEON=12.5 GHz) and five modulation format assignment scenarios including four single modulation format (BPSK, QPSK, 8QAM and 16QAM) and a distance-adaptive schemes. 1) Point-to-point link In this part, we simply estimated the spectrum utilization of a single point-to-point link with 3 comparative technologies including WDM, EON and our coarse granular routing EON (denoted as GRE). We assumed that the considered link includes Hs,d hops and has the total distance of Ds,d where (s, d) is the source and destination node pair of the link, and requested bitrate of the connection on the link is Rs,d (Gbps). Based on that, let CWDM be the channel capacity of BPSK WDM, the number of spectrum slots needed in the conventional WDM network, NSWDM(s, d), can be calculated as, ....,.. ..........(..,..)=......,... (6) ........ Hence, the total WDM spectrum is, ...... ........(..,..)=..............,... (7) ........ For conventional elastic optical network, the spectrum slot number required in a single modulation format scheme (which uses only one modulation format of optical signals) is given by, ....,.. ..........-......(..,..)=......,..(8) ........-...... where, MOD denotes the selected modulation format (it will be replaced by BPSK, QPSK, 8QAM or 16QAM) and CEON-MOD is the corresponding slot capacity. From Equation (8), the necessary spectrum of single modulation format elastic optical link can be evaluated as, ....,.. ........-......(..,..)=..............,... (9) ........-...... Let . be the spectrum grooming ratio (0<...1); .. ..=where GRE is the GRE granularity, the capacity ...... of coarse granular routing entity, and x is the average number of spectrum slots which carry the traffic in a coarse granular routing entity. Consequently, the number of spectrum slots and the corresponding total spectrum required for coarse granular routing EON link are respectively calculated as, 1....,.. ..........-......(..,..)=......,.., ........×........-...... (10) and, ......×............,.. ........-......(..,..)=......,... ........×........-...... (11) On the other hand, for the distance-adaptive scheme of both conventional EON and our GRE networks, the modulation format of each lightpath is determined individually and assigned dynamically according to total distance of the lightpath. Therefore, if we assume that the simplest modulation format assignment strategy, which assigns the possible highest order of modulation format, is used, the total spectrum slot number required by the distance adaptive scheme of EON and coarse granular routing EON networks are, = ..........-........(..,..) ....,.. ......,..........,.....16...... ........-16...... ....,.. ......,........16......<....,.....8...... ........-8...... (12) ....,.. ......,........8......<....,............. ........-............,.. ......,......h............,{ ........-........ and, = ..........-........(..,..) 1....,........,..........,.....16...... ........×........-16...... 1....,........,........16......<....,.....8...... ........×........-8...... 1....,.. ........,........8......<....,............. ......×........-........1....,.. ......,......h............. {........×........-........ (13) From Equations (12) and (13), the required spectrum utilization of elastic optical link and that of coarse granular routing EON are estimated accordingly by, ........-........(..,..)=..................-........(..,..)(14) and, ........-........(..,..)=......×..................-........(..,..).(15) 2) Spectrum utilization of the network Given a network topology G={V, E} in which V is the set of nodes, |V|=n, and E is set of links. For each node pair (s, d)((..,..)...x..), we assume that the traffic load requested from the source node, s, to the destination node, d, is Rs,d, the hop count and the distance of the route connecting s and d are Hs,d and Ds,d respectively. Based on the calculations given in Equations (7) and (9), total spectrum required in conventional WDM network is, ....,..........=.(..,..)...x................,.., (16) ........ ..... and the spectrum utilization of elastic optical networks for single modulation format scheme is given by, Informatica 42 (2018) 293–300 297 ....,..........-......=.(..,..)...x................,... (17) ........-...... ..... Similarly, from Equation (11), we have the total spectrum utilization of coarse granular routing elastic optical network for single modulation format scheme as following, ......×............,..........-......=.(..,..)...x........,... ........×........-...... ..... (18) Moreover, in distance-adaptive scheme, elastic optical networks including both conventional network and our developed network are able to assign modulation format dynamically. In fact, there are many different modulation assignment strategies, i.e. shortest path first (or least spectrum), least generating resource,… Depending on the applied strategy, the implementing portions of available modulation formats can be varied. If we assume that ., ß, . and . are coefficients which determine the distribution of the selected modulation formats (BPSK, QPSK, 8QAM and 16QAM) in the network respectively, ..0, ß.0, ..0, ..0 and .+ß+.+.=1. Based on Equations (17) and (18), the required spectrum of distance-adaptive conventional elastic optical network and that of coarse granular routing EON network can be calculated as, = ........-..................-........+..........-........ +..........-8......+..........-16......(19) = ........-..................-........+..........-........ +..........-8......+..........-16......(20) This means that the performance of distance adaptive networks is in the middle comparing to other single modulation format elastic networks. From Equations (16)-(20), the length of lightpaths, in term of both hop count and distance, significantly affects the usage of spectrum; longer paths are, more spectrum is required. It should be minimized to optimize the resource usage in elastic optical networks. In other words, the shortest paths should be used for lightpaths. However, note that implementing the shortest paths simply may result in a substantial spectrum collision. 3.2 Numerical results and discussion To estimate the performance efficiency of the developed coarse granular routing elastic optical network, we utilized the following parameters for numerical evaluations. The frequency grid of WDM network is 100 GHz and the spectrum slot bandwidth of EON and GRE networks is 12.5 GHz. Two tested network topologies are pan-European optical transport network, COST266, and US backbone network, USNET (see Figure 5). Traffic load is represented by the total traffic demand requested between node pairs which is assigned randomly according to a uniform distribution in the range from 50 Gbps to 500 Gbps (for each traffic load, 100 samples were tested and the average values were then plotted). In the numerical experiments, we have also assumed comparative elastic optical networks provide four typical modulation formats that are BPSK, QPSK, 8QAM and 16QAM. Consequently, there are five experimental network scenarios that are four single modulation format schemes (BPSK, QPSK, 8QAM, and 16QAM) and a distance-adaptive scheme. The coarse granular switching group capacity, GRE (the number of spectrum slots per group), is set as a variable. In fact, we tested the GRE granularity of 2, 4, and 8 (in case GRE=1, GRE network is equivalent to conventional EON). The obtained results of the corresponding WDM network are used as a benchmark (its graph is always 1); all obtained results for EON and GRE networks are compared to that of the corresponding WDM network and the relative data will be demonstrated. Figure 6: Spectrum utilization comparison for distance-adaptive scheme. Figure 5: Tested network topologies. Firstly, the spectrum usage comparison in the case of distance-adaptive scheme for the three comparative networks in COST266 and USNET topologies is illustrated in Figure 6. The obtained results show that both the GRE network and conventional elastic optical network offer a significant spectrum saving comparing to the traditional WDM network; up to 65% (45%) spectrum reduction can be achieved for COST266 (USNET) network topology with the traffic of 500 Gbps, thanks to the deployment of the flexible grid and dynamic modulation format assignment. It also demonstrates that the relative spectrum utilization of EON and GRE networks tends to decreased slightly as the traffic load becomes greater or finer granular routing is applied (smaller GRE granularity). That is because large traffic load can fill up huge channel spacing as used in conventional WDM networks and thus, using finer frequency grid does not help much to reduce the spectrum utilization. Note that, in this distance adaptive scheme, the spectrum utilization savings are less than those for 16QAM single modulation format scheme due to the possibility of implementing lower order modulation format to cope with the distance of required traffic without using any regenerating resource. Moreover, to verify the impact of the flexible modulation format assignment on the network performance, we compared five different network scenarios including four single modulation format schemes (BPSK, QPSK, 8QAM and 16QAM) and distance-adaptive scheme with the traffic load of 100 Gbps. The comparative results are described in Figure 7. It is demonstrated that employing higher order modulation formats offers better spectrum saving. Even the developed GRE network can reduce the hardware scale, the spectrum utilization of our network (as GRE=4) is more than that of EON due to the routing flexibility limitation. This also implies the importance of flexible modulation format assignment in saving spectrum while dealing the trade-off between the node routing flexibility (node cost) and the link resource requirement. a) COST266 b) USNET Figure 8: Dependence of the network spectrum usage on the GRE granularity. Finally, Figure 8 shows the dependence of spectrum utilization on the GRE granularity applied when the traffic load is fixed at 100 Gbps and 250 Gbps. Again, it is shown that finer granular routing (smaller GRE granularity) offers better network performance, in terms of spectrum utilization, especially for small traffic load. The reason is that small traffic load may not fill up whole the spectrum band switched in the GRE network. Finer granular routing is expected to reduce the spectrum utilization, however, it may result in an explosive increase in the hardware scale. Hence, in the network Informatica 42 (2018) 293–300 299 point of view, it is desirable to balance the spectrum usage and the hardware scale requirements. 4 Conclusion We have introduced a coarse granular routing elastic optical network that employs the developed coarse granular spectrum selective switch-based optical cross-connect architecture. By imposing coarse granular spectrum selective switching, the developed network is still able to exploit elastic optical networking technology while attaining a significant hardware reduction. To evaluate the performance of the coarse granular routing elastic optical network, we have clarified its spectrum utilization in various network scenarios, single modulation format (including BPSK, QPSK, 8QAM and 16QAM) and distance adaptive schemes, under different traffic conditions. We also compared the spectrum utilization of the network to that of corresponding traditional WDM network and conventional elastic optical network. Numerical results verified that, similar to conventional elastic optical network, the proposed network offers a substantial spectrum saving, says up to 65%, comparing to traditional WDM network. The developed network provides a promising solution to deal with the trade-off between node cost and link cost for creating cost-effective and spectrum-efficient future Internet backbone networks. 5 Acknowledgment This research is funded by Vietnam National Foundation for Science and Technology Development (NAFOSTED) under grant number 102.02-2015.39. 6 References [1] Cisco Visual Networking Index: Forecast and Methodology, Cisco system, 2014–2019. Retrieved from http://www.cisco.com/c/en/us/solutions/ collateral/service-provider/ip-ngn-ip-next­ generation-network/white_paper_c11-481360.pdf [2] E. B. Desurvire (2006). Capacity demand and technology challenges for lightwave systems in the next two decades. Journal of Lightwave Technology, IEEE, vol. 24, No. 12, pp. 4697-4710. https://doi.org/10.1109/JLT.2006.885772 [3] J. Berthold, A. Saleh, L. Blair, J. Simmons (2008). Optical networking: Past, present, and future. Journal of Lightwave Technology, IEEE, vol. 26, No. 9, pp. 1104-1118. https://doi.org/10.1109/JLT.2008.923609 [4] K. Sato, H. Hasegawa (2009). Optical Networking Technologies That Will Create Future Bandwidth-Abundant Networks. Journal of Optical Communications and Networking, IEEE/OSA, vol. 1, no. 2, pp. A81-A93. https://doi.org/10.1364/JOCN.1.000A81 [5] A. Jukan and J. Mambretti (2012). Evolution of Optical Networking Toward Rich Digital Media Services. Proceedings of the IEEE, IEEE, vol. 100, no. 4, pp. 855-871. https://doi.org/10.1109/JPROC.2011.2182076 [6] G. Bosco, V. Curri, A. Carena, P. Poggiolini, and F. Forghieri (2011). On the performance of Nyquist-WDM terabit superchannels based on PM-BPSK, PM-QPSK, PM-8QAM or PM-16QAM subcarriers. Journal of Lightwave Technology, IEEE, vol. 29, No.1, pp. 53–61. https://doi.org/10.1109/JLT.2010.2091254 [7] G. Zhang, M. De Leenheer, A. Morea and B. Mukherjee (2013). A Survey on OFDM-Based Elastic Core Optical Networking. IEEE Communications Surveys & Tutorials, IEEE, vol. 15, no. 1, pp. 65-87. https://doi.org/10.1109/SURV.2012.010912.00123 [8] M. Jinno, H. Takara, B. Kozicki, Y. Tsukishima, Y. Sone, and S. Matsuoka (2009). Spectrum-Efficient and Scalable Elastic Optical Path Network: Architecture, Benefits, and Enabling Technologies. IEEE Communications Magazine, IEEE, vol. 47, pp. 66-73. https://doi.org/10.1109/MCOM.2009.5307468 [9] O. Gerstel, M. Jinno, A. Lord and S. J. B. Yoo (2012). Elastic optical networking: a new dawn for the optical layer?. IEEE Communications Magazine, IEEE, vol. 50, no. 2, pp. s12-s20. https://doi.org/10.1109/MCOM.2012.6146481 [10] A. Lord, P. Wright and A. Mitra (2015). Core Networks in the Flexgrid Era. Journal of Lightwave Technology, IEEE, vol. 33, no. 5, pp.1126-1135. https://doi.org/10.1109/JLT.2015.2396685 [11] M. Jinno, B. Kozicki, H. Takara, A. Watanabe, Y. Sone, T. Tanaka and A. Hirano (2010). Distance-adaptive spectrum resource allocation in spectrum-sliced elastic optical path network. IEEE Communications Magazine, IEEE, vol. 48, no. 8, pp.138-145. https://doi.org/10.1109/MCOM.2010.5534599 [12] B. Chatterjee, N. Sarma and E. Oki (2015). Routing and Spectrum Allocation in Elastic Optical Networks: A Tutorial. IEEE Communications Surveys & Tutorials, IEEE, vol. PP, no. 99, pp. 1. https://doi.org/10.1109/COMST.2015.2431731 [13] T. Zami, D. Chiaroni (2012). Low contention and high resilience to partial failure for colorless and directionless OXC. Proceedings of Photonics in Switching, OSA, paper Fr-S25-O15. Retrieved from https://ieeexplore.ieee.org/document/6608250 [14] I. Kim, P. Palacharla, X. Wang, D. Bihon, M. D. Feuer, S. L. Woodward (2012). Performance of Colorless, Non-directional ROADMs with Modular Client-side Fiber Cross-connects. Proceedings of Optical Fiber Communication Conference (OFC2012), OSA, paper NM3F.7. https://doi.org/10.1364/NFOEC.2012.NM3F.7 [15] Y. Li, L. Gao, G. Shen, L. Peng (2012). Impact of ROADM colorless, directionless and contentionless (CDC) features on optical network performance. Journal of Optical Communication and Networking, IEEE, vol. 4, No. 11, pp. B58-B67. https://doi.org/10.1364/JOCN.4.000B58 H.-C. Le et al. [16] H.-C. Le, H. Hasegawa, K. Sato (2014). Performance evaluation of large-scale multi-stage hetero-granular optical cross-connects. Optics Express, OSA, vol. 22, no. 3, pp. 3157-3168. https://doi.org/10.1364/OE.22.003157 [17] Y. Taniguchi, Y. Yamada, H. Hasegawa, and K. Sato (2012). A novel optical networking scheme utilizing coarse granular optical routing and fine granular add/drop. Proceedings of OFC/NFOEC, OSA, pp. JW2A.2. https://doi.org/10.1364/NFOEC.2012.JW2A.2 [18] R. Hirako, K. Ishii, H. Hasegawa, K. Sato, H. Takahashi, M. Okuno (2011). Development of Single PLC-Chip Waveband Selective Switch that Has Extra Ports for Grooming and Termination. Proceedings of the 16th Opto-Electronics and Communications Conference, IEEE, pp. 492-493. Retrieved from https://ieeexplore.ieee.org/document/6015223 [19] Hai-Chau Le, Thanh Long Mai, Ngoc T. Dang (2017). Spectrum Utilization of Coarse Granular Routing Elastic Optical Networks. Proceedings of SoICT’17: Eighth International Symposium on Information and Communication Technology, pp. 197-203. https://doi.org/10.1145/3155133.3155180 Time-stamp Incremental Checkpointing and itsApplicationfor an Optimizationof Execution ModeltoImprovePerformanceof CAPE Van LongTran Samovar, Télécom SudParis, CNRS, UniversitéParis-Saclay -9rue CharlesFourier, Évry, France E-mail: van_long.tran@telecom-sudparis.eu and www.telecom-sudparis.eu Hue Industrial College -70 Nguyen Hue street, Hue city,Vietnam E-mail: tvlong@hueic.edu.vn and www.hueic.edu.vn Éric Renault Samovar, Télécom SudParis, CNRS, UniversitéParis-Saclay -9rue CharlesFourier, Évry, France E-mail: eric.renault@telecom-sudparis.eu and www.telecom-sudparis.eu Viet Hai Ha Collegeof Education, Hue University -34Le Loi street, Hue city,Vietnam E-mail: haviethai@gmail.com and www.dhsphue.edu.vn Xuan Huyen Do Collegeof Sciences, Hue University -77 Nguyen Hue street, Hue city,Vietnam E-mail: doxuanhuyen@gmail.com and www.husc.edu.vn Keywords: OpenMP, OpenMP on cluster, CAPE, Checkpointing-AidedParallelExecution, Checkpointing, Incremental checkpointing, DICKPT, TICKPT Received: March 29, 2018 CAPE, which standsfor Checkpointing-AidedParallelExecution,isa checkpoint-based approachto au­tomatically translate and execute OpenMP programs on distributed-memory architectures. This approach demonstrates high-performance and complete compatibility with OpenMP on distributed-memory systems. In CAPE, checkpointingis oneof the mainfactors acted on the performanceof the system. Thisis shown overtwoversionsofCAPE.Thefrstversionbasedoncomplete checkpointsistooslowascomparedtothe second version based on Discontinuous Incremental Checkpointing. This paper presents an improvement of Discontinuous Incremental Checkpointing, and a new execution model for CAPE using new techniques of checkpointing. It contributes to improve the performance and make CAPE even more fexible. Povzetek: Predstavljena je izboljšava CAPE -paralelno izvajanje, usmerjeno s podporo redundance. 1 Introduction In order to minimize programmers’ diffculties when de­veloping parallel applications, a parallel programming tool at a higher level should be as easy-to-use as possible. MPI [1], which stands for MessagePassing Interface, and OpenMP [2] are two widely-used tools that meet this re­quirement. MPI is a tool for high-performance computing on distributed-memory environments, while OpenMP has been developed for shared-memory architectures. If MPI is quitediffcultto usefor non programmers, OpenMPisvery easy to use, requesting the programmer to tag the pieces of code to be executed in parallel. Some efforts have been made to port OpenMP on distributed-memory architectures. However,apart from our solution, no solution successfully met the two following requirements: 1) to be fully compliant with the OpenMP standard and 2) high performance. Most prominent ap­proaches include the use of an SSI [3], SCASH [4], the use of the RC model [5], performing a source-to-source translation to a tool like MPI [6, 7] or Global Array [8], or Cluster OpenMP [9]. Among all these solutions, the use of a Single Sys­tem Image (SSI) is the most straightforward approach. An SSI includes a Distributed Shared Memory (DSM) to provide an abstracted shared-memory view over a physi­cal distributed-memory architecture. The main advantage of this approach is its ability to easily provide a fully-compliant version of OpenMP. Thanks to their shared-memory nature, OpenMP programs can easily be com­piled and run as processes on different computers in an SSI. However, as the shared memory is accessed through the network, the synchronization between the memories in­volves an important overhead which makes this approach hardly scalable.Someexperiments[3]haveshownthatthe larger the number of threads, the lower the performance. As a result, in order to reduce the execution time over­head involved by the use of an SSI, other approaches have been proposed.Forexample, SCASH only maps the shared variables of the processes onto a shared-memory area at­tached to each process, the other variables being stored in a private memory, and the RC model uses the relaxed consistency memory model. However, these approaches have diffculties to identify the shared variables automat­ically. As a result, no fully-compliant implementation of OpenMP based on these approaches has been released so far. Some other approaches aim at performing a source-to-source translation of the OpenMP code into a MPI code. This approach allows the generation of high-performance codes on distributed-memory architectures. However, not all OpenMP directives and constructs can be implemented. As yet another alternative, Cluster OpenMP, proposed by Intel, also requires the use of additional directives of its own (ie. not included in the OpenMP standard). Thus, this one cannot be considered as a fully-compliant implemen­tation of the OpenMP standard either. Concerning to bypass these limitations, we developed CAPE[10,15] which standsfor Checkpointing-AidedPar­allel Execution. CAPE is a solution that provides a set of prototypes and frameworks to automatically translate OpenMP programs for distributed memory architectures and make them ready for execution. The main idea of this solution is using incremental checkpoint techniques (ICKPT) [11, 12] to distribute the parallel jobs and their data to other processes (the fork phase of OpenMP), and collect the results after the execution of the jobs from all processors (the join phase of OpenMP). ICKPT is also used to deal with the exchange of shared data automatically. Although CAPE is still under development, it has shown its abilitytoprovideaveryeffcient solution.Forexample, a comparison with MPI showed that CAPE is able to reach up to 90% of the MPI performance [13, 14]. This has to be balanced with thefact that CAPE for OpenMP requires the introduction of few pragma directives only in the se­quential code, i.e. no complex code from the user point of view, while writing a MPI code might require the user to completely refactorise the code. Moreover, as compared to other OpenMP for distributed-memory solutions, CAPE is fully compatible with OpenMP [13, 15]. This paper presents an improvement of DICKPT – a checkpoint technique for CAPE, and a new execution model applied these new checkpoints, that improves the performance and the fexibility of CAPE.A part of these results were also presented and published at the SoICT’s 2017 conference [16]. The paper is organized as follows: the next section describes CAPE mechanism, capabilities and restrictions in details. Section 3 presents a develop­ment of checkpointing that are used in CAPE. Then, Sec­tion 4 presents the design and the implementation of the newexecution model based on the newcheckpointing tech­niques. The analysis and evaluation of both new check-pointing and execution model are presented in Section 5. Section4showsthe resultoftheexperimental analysis.Fi­nally,Section5concludes the paper and presents our future works. 2 CAPE principles In order to execute an OpenMP program on distributed-memory systems,CAPE usesasetof templatesto translate an OpenMP source code into a CAPE source code. Then, the generated CAPE source code is compiled using a tra­ditional C/C++ compiler. At last, the binary code can be executedindependently on anydistributed-memory system supporting the CAPE framework. The different steps of the CAPE compilation process for C/C++ OpenMP programs is shown in the Figure 1. Figure1:Translationof OpenMP programs with CAPE. 2.1 Execution model The CAPE execution model is based on checkpoints that implement the OpenMP fork-join model. This mecha­nism is shown in Figure 2. To execute a CAPE code on a distributed-memory architecture, the program frst runs on asetofnodes,eachnodebeingrunasa process. Whenever the program meets a parallel section, the master node dis­tributes the jobs among the slave processes using the Dis­continuous Incremental Checkpoints (DICKPT) [12, 13] mechanism. Through this approach, the master node gen­erates DICKPTs and sends them to the slave nodes, each slave node receives a single checkpoint. After sending checkpoints, the master node waits for the results to be re­turned from the slaves. The next step is different depending upon the nature of the node: the slave nodes receive their checkpoint, inject it into their memory, execute their part of the job, and sent back the result to the master node by using DICKPT;the master nodewaits for the results and af­ter receiving them all, merges them before injection into its memory. At the end of the parallel region, the master sends the resulting checkpoint to every slaves to synchronize the memory space of the whole program. 2.2 Translation from OpenMP to CAPE In the CAPE framework, a set of functions has been defned and implemented to perform the tasks devoted to DICKPT, typically, distributing checkpoints, send­ing/receiving checkpoints, extracting/injecting a check­point from/to the program’s memory, etc. Besides, a set of templates has been defned in the CAPE compiler to perform the translation of the OpenMP program into the CAPE program automatically and make it executable in the CAPE framework. Sofar, nested loops and shared-data variable constructs are not supported yet. However, this is not regarded as an issue as this can be solved at the level Figure 2: CAPE execution model. of the source-to-source translation and does not require any modifcations in the CAPE philosophy. In this end, CAPE can only be applied to OpenMP programs matching the Bernstein’s conditions [17]. After the translations operated by the CAPE compiler, the OpenMP source code is free of anyOpenMP directives and structures. Figure3presentsanexampleof code sub­stitution for the specifc case of the parallel for con­struct. This example is typical of those we implemented for the other constructs [7]. The automatically generated code is based on the following functions that are part of the CAPE framework: – start( ) sets up the environment for the genera­tion of DICKPTs. – stop( ) restores the environment used for the gen­eration of DICKPT. – create(file) generates a checkpoint with name file. – inject(file) injects a checkpoint into the mem­ory of the current process. – send(file, node) sends a checkpoint from the current process to another. – wait_for(file) waits for checkpoints and merges them to create another one. – merge(file1,file2) merges two checkpoints together. Figure 3: Template for the parallel for with incre­mental checkpoints. – broadcast(file) sends a checkpoint to all slave nodes. – receive(file) waits for and receives a check­point. 2.3 Discontinuous incremental checkpointing on CAPE Checkpointing is the technique that saves the images of a process at a point during its lifetime, and allows it to be resumed at the saving’s time if necessary [11, 18]. Using checkpointing, processes can resume theirexecution froma checkpoint statewhenafailure occurs.So,noneedtotake timeto initializeandexecuteitfromthebegin. Thesetech­niques are introduced since two decades ago. Nowadays, theyare researched and used widely onfault-tolerance, ap­plications trace/debugging, roll-back/animated playback, and process migration. Basically, checkpointing techniques can be categorized into two groups: completed checkpoints and incremental checkpoints. Completed checkpointing [18, 19, 20] saves all information regarding the process at the points that it generate checkpoints. The advantages of this technique is reducing the time of generation and restoration. However, the checkpoint’s sizeistoo large. Incremental checkpoint-ing [11, 21, 22, 23, 12, 24] only saves the modifed infor­mation as compared to the previous checkpoint. This tech­nique reaches advantages of reducing checkpoint’s over­head and checkpoint’s size, so it is in widely used in dis­tributed computing. Besides, using data compression to re­duce checkpoint’s size [11, 21, 24], it is also focus on the techniques that detect modifed databut reach the minimum of size. Some typical techniques are using page-based pro­tection to identify the pages in memory that havebeen mod­ifed [11, 22, 23], usingword-level granularity [21, 12], us­ing block encoding [22], using user-directed and memory exclusion [11], using live variable analysis [24]. Figure 4: Principle of DICKPT in cases of checkpointing. In CAPE, Discontinuous Incremental Checkpointing (DICKPT) is a development based on incremental check-pointing, that contains two kinds of data, register infor­mation and modifed data of the process. In which, the frst one is copied from all register data of the process, and the second one is identifed based on write-protection tech­niques. Figure 4 shows the steps to monitor and generate a checkpoint of a process on CAPE. It is done by an other process making use of the ptrace Unix system call. The idea of these steps is that, at the beginning of the paral­lel region, the monitor sets all page of monitor process at write-protected. Whenever the monitored process wants to write into anywrite-protected page,a SIGSEGV signalis generated. Then, the monitor saves the data of this page, re­moves the write-protection and lets the monitored process write into the page. At the end of the region, the monitor compares the saved data with the current data of monitored process page. The difference are extracted and saved into checkpoint fle. 2.4 Remarks The good performance of CAPE as compared to those of MPI and the full compliance to the OpenMP specifca­tions [13, 15, 14] have made CAPE a good alternative to port OpenMP on distributed-memory architectures. Sofar, the implementation of CAPE is not complete, some disad­vantages can be listed: 1. DICKPT saves all modifed data of process, including temporary and private variables. This is an unneces­sary synchronization inside an OpenMP program. 2. As shown in Figure 2, the master node might act as a bottleneck while waiting for checkpoints from the salves, merging checkpoints and/or sending back data to slaves for memory synchronization. 3. To distribute jobs to slaves, the master node gener­ates a number of checkpoints that depends upon the number of slave nodes and so that each slave node re­ceives a checkpoint (see Figure 7). This method can reach a high-level of optimization. However, it might notbe enoughfexiblefor some caseslike1)the num­ber of slaves may not be identifed at compile time, 2) the OpenMP source code should be modifed to de­tect when the master generates the checkpoint and 3) the dynamic scheduling of OpenMP cannot be imple­mented using this method. 4. After distributing the jobs, the slave nodesexecute the divided jobs while the master does nothing until the reception of the resulting checkpoints from the slaves, which clearly wastes resources. 5. For synchronization, the checkpoints should be sent by order in order to resume exactly the last state of process. 3 Time-stamp incremental checkpointing (TICKPT) Time-stamp Incremental Checkpointing (TICKPT) is an improvement of DICKPT by adding new factor – time-stamp – into incremental checkpoints andby removing un­necessary data based on data-sharing variable attributes of OpenMP program. Basically, TICKPT contains three mandatory elements including register’s information, modifed region in mem­ory of the process, and their time-stamp. As well as DICKPT, in TICKPT, the register’s information are ex­tracted from all registers of the process in the system. How­ever, the time-stamp is added to identify the order of the checkpoints in the program. This contributes to reduce the time for merging checkpoints and selecting the right ele­ment if located at the same place in memory. In addition, only the modifed data of shared variables are detected and saved into checkpoints. It makes checkpoint’s size signif­icantly reduced depending on the size of private variables of the OpenMP program. To present the order of checkpoints in a program, time­stamps have to represent the order of the instructions when it is executed. For the general case, an activation tree [25] can be used to identify the sequence of function call in a program. For CAPE, checkpoints are always generated in same level of functions, so that the program counter can be used to ensure simplicity. However, if the instruction is a loop, the program counter is combined with the loop iteration to represent the order of the loop exactly. To detect modifed data, the write-protection mechanism is used. However, only the shared variables are written down in the checkpoint fle. The matter in here is how to detect private and shared variables. static local variables allocated in heap and data seg­ments of the process’s memory, and local variables allo­cated on the stack (see Figure 5). The variables in heap and data segments can easily be identifed by their ad­dress. For the variables on thestack, we save the stack pointer before entering the #pragma omp parallel region. Variables declared before the stack pointer are shared. The others, are private. To explicitly, change the status of a variable, the pro­grammer can use data-sharing attributes like OpenMP di­rective #pragma omp threadprivate (list of variables) and relative clauses. The OpenMP data-sharing clauses are showninTable1. Clauses Description default(none|shared) Specifying the default behavior of variables shared(list) Specifying the list of shared variables private(list) Specifying the list of private variables frstprivate(list) Allowing to access value of the list of private variables in the frst time lastprivate(list) Allowing to share value of the list of private variables at the end of parallel region copyin(list) Allowing to access value of threadprivate variables copyprivate(list) Specifying the list of private variables that should be shared among all threads. reduction(list, ops) Specifying the list of variables that are subject to a reduction operation at the end of the par­allel region. Table 1: OpenMP data-sharing clauses. 4 Anew execution modelfor CAPE In order to improve the performance of CAPE and its fexi­bility, we designed a new execution model that extends the one presented in Section 2.1. In this new execution model, DICKPT is replaced by TICKPT. Figure 6 illustrates the model which can be described as follows: Figure 5: Allocation of OpenMP program’s variables in virtual process memory. In an OpenMP program, data-sharing variable attributes can be set up either, implicitly or explicitly [2]. All vari­ables declared outside an #pragma omp parallel di­rective are implicitly shared. This includes all global and 1. At the beginning of the program, all nodes in the sys­tem execute the same sequential code. 2. When a parallel region is reached, the master process creates a set of incremental checkpoints. The number of incremental checkpoints depends upon the num­ber of tasks in the parallel region. Each incremen­tal checkpoint contains the state of the program to be Figure 6: The new execution model for CAPE. used to resume its execution in another process at the saved time. 3. The master process scatters the set of incremental checkpoints. Each node receives some of the check­points generated by the master process. This step is illustrated in the Figure 8. 4. The received incremental checkpoints are injected into the slave processes’ memories. 5. The slave processes resume their execution. 6. Results on slave processes are extracted by identify­ing the modifed regions and saved as an incremental checkpoint. 7. Incremental checkpoints of each process is sent back to the master node. Incremental checkpoints are com­bined altogether to generate a single checkpoint. This step can be distributed among the processes if need be. 8. The fnal combined incremental checkpoint is injected in the master process’ memory and the master process can resume its execution. Changing the execution model implies changing the translation templates. Figure9presents the template for the #pragma omp parallel for directivethat adapts to the new execution model. The other OpenMP directives canbe designedina similarway.For this template, CAPE operates as follows: Figure 8: Scheduling method with the new execution model. – generate_dickpt(beforei) (line 3): at each loop it­eration, the master process generates an incremental checkpoint. – scatter(before, &recvn, master) (line 4): the mas­ter process scatters the checkpoints to the available processes, including itself. Each process receives someof the checkpoints(recvn). – inject(recvn) (line 5): each checkpoint is injected into the target process’ memory. – the execution is resumed on instruction D (line 6). – generate_dickpt(aftern) (line 7): each process gen­erates an incremental checkpoint that saves the result of its execution. – allreduce(aftern, &after, [< ops >]) (line8): the aftern checkpoint of process n is sent to the other processes. Checkpoints are combined, calculated and saved in a new after checkpoint. With TICKPT, the order of checkpoints is presented in each of them, so this is performed using the Recursive Doubling algo­rithm [26] as illustrated in Figure 10. – inject(after) (line 9): incremental checkpoint after #define N 1000 ... int A[N], B[N], C[N], i; ... #pragma omp parallel for private(A,B) shared(C) for(i = 0; i .delay, the value of . is increased in order to increase the infuence of the delayfactor in the cost function (see Eq. 3). The relationship between parameter ., the cost and the delay of agiven path canbe illustratedby the following lemmas. Lemma 1. If 0 . .1 . .2 then D(r.1) . D(r.2) and C(r.1) . C(r.2). Proof. See [9]. Lemma1shows thata larger . will lead to a larger cost and a smaller delay. This implies that as long as the re­sulting shortest path does not violate the predefned delta delay, a smaller . will defnitely result to a better solution. The next lemma is used to fnd the smallest . value (i.e., . related to the shortest path that does not violate .delay). Lemma 2. If .1 <.2, D(r.1) 6= D(r.2), .0 = C(r.1)-C(r.2) , then C(r.1) . C(r.0 ) . C(r.2), D(r.2)-D(r.1) D(r.1) . D(r.0 ) . D(r.2). Proof. See [9]. C(r.1)-C(r.2) Lemma2shows that with .0 = ,the short­ D(r.2)-D(r.1) est path r.0 must have a delay between the delays of r.1 and r.2, and the cost is also between the costs of these two paths. The above two lemmas imply that the least cost path rc has to be frst computed using Dijkstra’s algorithm w.r.t. the cost. If its delay is not greater than .delay, then it must be the optimal solution. Otherwise, the least delay path rd, i.e., the path, found by using Dijkstra’s algorithm w.r.t. the delay should be obtained. If its delay is greater than the .delay, no optimal solution can be achieved. If none of the above conditions are true, the algorithm begins an iterative procedure. In each iteration, rd is updated with a better so­lution having a lower delay and rc is updated with a better solution having lower cost. Algorithm1 Lagrange relaxation-based routing algorithm Require: source, dest, C, D, .delay Ensure: Optimal path 1: rc ‹ Dijkstra(source, dest, C) 2: if D(rc) . .delay then return rc 3: end if 4: rd ‹ Dijkstra(source, dest, D) 5: if D(rd) > .delay then return "No solution" 6: end if 7: while true do C(rc)-C(rd) 8: . := D(rd)-D(rc) 9: P‹ Dijkstra(source, dest, C. ) 10: if C. (P)= C. (rc) then return rd 11: else if D(P) . .delay then 12: rd ‹P 13: else 14: rc ‹P 15: end if 16: end while The heuristic algorithm(see Algorithm1)is describedin detail as follows: First, we utilize the original cost function (Eq. 1) and fnd the least cost path using Dijkstra’s algo­rithm. If the delay of this path meets the delay requirement .delay, it is the optimal path. Otherwise, we fnd the least delay path and examine whether the delay of this path is greater than .delay. We may then decide to start the loop or to stop the algorithm as there is no optimal solution that can be found. The . parameter is computed as C(rc) - C(rd) . := D(rd) - D(rc) Such parameter is updated after each iteration. The Di­jkstra’s algorithm is used w.r.t the new value of cost C. . If the cost value of the new path is found to be equal to the cost value of the least cost path, the optimal path should be the least delay path. If not, if the delay of the new path is found to be smaller than .delay, the least delay path is updated as the new path. Otherwise, the least cost path is considered as the new path. The loop is repeated until the optimal path is found. 4 Experiments 4.1 Experimental setup In order to conduct the experiments, we implemented a testbed, which is illustrated in the Fig. 5. The user is able to send requests to the E-healthcare system using Restful APIs from anyplatforms and devices. In the present work, we developed an Android application that automatically creates and sends the RESTful requests. A load balanc­ing mechanism with PC coordinator is also implemented to support high rate of requests. Concretely, the coordina­tor applies the Round Robin algorithm to distribute the re­quests to three Request Analysis PCs (RA1-3). These three RA PCs analyze the incoming requests and search for the appropriate outputs in the Ontology DB database. In the SDN controller, these outputs are then fed to the routing algorithm that determines the appropriate way to control the simulated network. In this paper, we utilize mininet [6] to emulate the network topology, which involves a num­ber of different nodes representing the Openfow-enabled switches and IoT devices. Figure 5: The testbed, implemented with load balancing mechanism 4.2 Analyzing the request analysis layer First, we assess the scalability of the Request Analysis layer by varying the number of sending requests per sec­ond and evaluating the round trip time (RTT). In Fig. 6, it is obvious that the load balancing mechanism is able to supporthigh request rate.It achievesaRTTof2.2 seconds when the rate reaches 500 requests per second, while the RTT without load balancing mechanism is 26.8 seconds. This in turn, proves the scalability of the Request Analysis layer. In ordertoevaluatethe accuracyofthe Request Analysis layer, we execute 10000 requests, and send them to fve applications (2000 requests per application). Figure 6: RTT corresponding to the Request Analysis layer, which is deployed with and without load balancing mechanism. Applications Number of user requests Number of well-classifed requests Prop. Monitoring 2000 1928 96.4 % Therapeutic 2000 1987 99.3 % Fitness and Wellness 2000 1979 98.9 % Behavioral 2000 1893 94.6 % Rehabilitation 2000 1912 95.6 % Table 1: Accuracyrelated to the classifcation of user re­quests in the fve applications Table1shows that majority of user requests have been classifedexactlyforallfveapplications.Thefaultsmostly come from the Behavioral application. This is due to the fact that the Request Analysis layer is based on the pro­cessing of text strings. The Behavioral application on the other hand, includes a variety of activity and emotion de­scriptions such asfall, sleep,exercise, anxiety, stress, de­pression, etc. Hence, it is more diffcult to classify the user tasks. However, an accuracy of 94.6% is still acceptable for this kind of application. 4.3 Analyzing the routing layer Concerning the performance of the routing layer,we imple­mented several other methods that also focus on the DCLC problem. These methods are as follows. – The Constrained Bellman-Ford(CBF)routing algo­rithm [29], which is based on a breadth-frst search that is able to update the lowest cost path in each vis­ited node. CBF runs until either the highest constraint is exceeded or it cannot improve the paths anymore. – The Multi-Criteria Routing algorithm(MCR)devel­oped by Lee et al. in [16]: This routing algorithm is based on heuristics of ranked metrics in the network. A loop is repeated to determine the shortest path for each metric until the best path is found, or itfails for all metrics. – The routing algorithm proposed by Cheng et al. [4] that combines the problems of fnding the least cost and least delay paths by modifying the cost function. Such algorithm is abbreviated as MCF in the present work. It aims to compute a simple metric from mul­tiple requirements using the weighted combination of the various QoS metrics. The above mentioned algorithms are compared with the proposed algorithm (abbreviated as LARE in this paper) using the following measures: – Number of fails, which is the average number of un­reachable nodes, which occurs when the path cannot be found at a given delta delay .delay. This measure aims at assessing the effciency of a given algorithm in fnding the destination nodes. – Delay, which is the average delay time of the path from a source node to a reachable destination node. – Cost, which is considered as the average cost of the paths from a source node to all reachable destination nodes. Concerning the network topology, we frst validate the routing algorithms and their functionality with a simple network including 17 nodes. We then carried out the ex­periments with NTT, a 37 node network topology [7] that is modeled using the exact characteristics of a real-world network. Finally, we construct by hand a huge network with 150 nodes (abbreviated as 150N topology) toevaluate the scalability of the proposed framework. To obtain the performance related to the various rout­ing algorithms, this work varies the value of delta delay (.delay)from 100 ms to 1000 ms in both the NTT and 150N topologies. As it can be seen from Figs. 7 and 8, when .delay is smaller than 200 ms, the number of un­reachable nodes is considerably high. This number de­creases as the delay constraint is increased. All the paths are only found after 400 ms. Figure 7: Average number of unreachable nodes in NTT topology Figs.9and10showtheaveragedelay curvesofthevari­ous algorithms in NTT and 150N topologies. As illustrated, with the low values of delta delay (ranging from 100 ms to Figure 8: Average number of unreachable nodes in 150N topology 300 ms), the number of reachable nodes is small. The des­tination nodes are close to the source nodes, keeping the delay values at very low level. After 400 ms delta delay, when all the paths are found, the average delay becomes stable. Figure 9: Average delay of the various routing algorithms in NTT topology It is clear that the LARE algorithm provides the best re­sult in terms of average delay in both the network topolo­gies. The scalability of LARE has been proven in 150N (see Fig. 10), a large network having extremely high node density, where the performancegap between the proposed algorithm and other methods become more signifcant. At 1000 ms delta delay, LARE gives an average delay of 456 ms, while those of other methods are more than 660 ms. The MCF algorithm provides the highest average delay in both cases due to its diffculty to select an appropriate ag­gregate weights when combining the QoS metrics. Figs. 11 and 12 shows the average cost of the paths foundbythevarious algorithmsintwo network topologies. Asexplained above,at thebeginning(.delay < 300ms)it is impossible to fnd paths to all destination nodes. Since the obtained paths are very short, the costs become rela­Figure10:Averagedelayofthevariousrouting algorithms in 150N topology tively small. The algorithms fnd more paths as we in­crease the delta delay. When all destination nodes are found(.delay . 300ms), a higher delta delay would re­sult in a lower cost. It is obvious that LARE algorithm gives almost similar outcomes to MCF in both the network topologies. Thisisexactlywhatweexpected becauseMCF focuses on optimizing the cost value. MCR is the worst performer, since it only fnds the best path for one met­ric, while ignoring the cost value. Especially, in the 150N topology (Fig. 12), LARE achieves an average cost of 19.5 at the delta delay of 1000 ms, while the MCR produces an average cost of 30.1. 4.4 Analyzing the overall SHIOT framework This section aims to validate the capability of supporting stressed network of the proposed framework and compared it with the traditional system, which relies on the sim­ple best-effort policy and is implemented without SDN. Specifcally, we try to stress the system by gradually in­ Figure 13: Capability of supporting stressed network of SHIOTand the traditional system in terms of delay. We set the delta delay to the value of 1000 ms to en­sure that the paths to all the destination nodes are found. As shown in Figs. 13 and 14, SHIOT is obviously better than the traditional couterpart in terms of delay and cost. Even in the most stressed state (i.e., 2000 requests per sec­ond), SHIOTis able to provide an average delay of 694 ms and a cost of 33, while those, achieved by the traditional system are 935ms and 69, respectively. The performance difference maybe due to the simple policy(the best-effort policy) that is implemented in the core network of the tra­ditional system. SHIOT, on the other hand, possess a lay­ered architecture that is able to deal with the high rate of requests, sent from the various applications. Finally, we evaluate the system performance while run­ning the video application. Specifcally, video fows are generated (video streaming) using an open source software, named VLC. Such fows are sent from one node to another in the simulated network. The experiment lasts fve min­utes. The delay and jitter are computed for each chosen path. InTable2, we can see that SHIOToutperforms the conventional system. Its average end-to-end delay is about 3.5 s while that of the traditional system is 7.3 s. Simi­lar observation can be obtained in terms of jitter. This is exactly what we expected because SHIOT has the ability to differentiate the different types of data fow (e.g. video, audio, information data). 5 Conclusions Inthispaper,wehave proposedalayeredSDNframework, named SHIOT, to address the heterogeneity issue in the IoT. SHIOT is based on an open ontology to classify the incoming user requests. This framework also utilizes the Lagrange relaxation theory to fnd the optimal path in order to forward these requests to the destination nodes. In gen­eral, SHIOT can be considered as a remedy to bridge the gap between abstract high-level tasks and other low-level networks/devices. Experimental results showed that SH­IOT yielded better performance when compared with the traditional system, that is deployed without SDN. It is also proved to be effcient and effective in handling tasks re­quired by the various applications. Concerning the future works, we are in the process of evaluating SHIOT, which is implemented in the real devices (e.g. Openfow-enabled switches). 6 Acknowledgments This work is supported by the Vietnamese Ministry of Science and Technology (MOST) under the Grant No. NDT.14.TW/16. References [1] The foodlight open sdn controller project. http://www.projectfoodlight.org/foodlight/. [2] M. Al-Ayyoub,Y. Jararweh,E. Benkhelifa,M.Vouk, A. Rindos, et al. Sdsecurity: A software de­fned security experimental framework. In Com­munication Workshop (ICCW), 2015 IEEE Inter­ Time (s) SHIOT Traditional system delay (s) jitter (s) delay (s) jitter (s) 20 7.3 0.93 8.3 0.9 40 9.2 0.84 8.7 0.7 60 9.1 0.52 9.3 0.8 80 7.7 0.68 8.5 0.6 100 4.3 0.34 7.6 0.62 120 5.2 0.26 7.2 0.57 140 3.5 0.33 6.8 0.61 160 4.5 0.41 7.1 0.63 180 3.8 0.27 7.4 0.61 200 4.7 0.43 7.6 0.58 220 3.9 0.39 7.2 0.62 240 3.5 0.52 7.3 0.63 260 3.2 0.45 7.8 0.60 280 3.3 0.42 7.2 0.62 300 3.5 0.33 7.1 0.59 Table 2: Comparison of SHIOTand the traditional system, while running video streaming application national Conference on, pages 1871–1876. IEEE, 2015. DOI: https://doi.org/10.1109/ ICCW.2015.7247453. [3] S. Chakrabarty, D. W. Engels, and S. Thathapudi. Black sdn for the internet of things. In Mo­bile Ad Hoc and Sensor Systems (MASS), 2015 IEEE 12th International Conference on, pages 190– 198. IEEE, 2015. DOI: https://doi.org/10. 1109/MASS.2015.100. [4] S. Chen and K. Nahrstedt. On fnding multi-constrained paths. In Communications, 1998. ICC 98. Conference Record. 1998 IEEE International Conference on, volume 2, pages 874–879. IEEE, 1998. DOI: https://doi.org/10.1109/ ICC.1998.685137. [5] A. Darabseh,M. Al-Ayyoub,Y. Jararweh,E. Benkhe­lifa, M. Vouk, and A. Rindos. Sdstorage: a software defned storage experimental framework. In Cloud Engineering (IC2E), 2015 IEEE Inter­national Conference on, pages 341–346. IEEE, 2015. DOI: https://doi.org/10.1109/ IC2E.2015.60. [6] R. L. S. de Oliveira, A. A. Shinoda, C. M. Schweitzer, and L. R. Prete. Using mininet for em­ulation and prototyping software-defned networks. In Communications and Computing (COLCOM), 2014 IEEE Colombian Conference on, pages 1– 6. IEEE, 2014. DOI: https://doi.org/10. 1109/ColComCon.2014.6860404. [7] G. Di Caro, F. Ducatelle, and L. Gambardella. An-tHocNet: an adaptive nature-inspired algorithm for routing in mobile ad hoc networks. European Transactions on Telecommunications, 16(5):443– 455, 2005. DOI: https://doi.org/10.1002/ ett.1062. [8] A. Doria, J. H. Salim, R. Haas, H. Khosravi, W.Wang, L. Dong, R. Gopal, and J. Halpern. For­warding and control element separation (forces) pro­tocol specifcation. Technical report, 2010. DOI: https://doi.org/10.17487/RFC3746. [9] G. Feng and C. Doulgeris. Fast algorithms for delay constrained leastcost unicast routing. shortened ver­sion presented on INFORMS, 2001. [10] M. R. Gary and D. S. Johnson. Computers and intractability: A guide to the theory of np-completeness, 1979. [11] J. Gubbi,R. Buyya,S. Marusic,andM.Palaniswami. Internet of things (iot): A vision, architec­tural elements, and future directions. Future Generation Computer Systems, 29(7):1645–1660, 2013. DOI: https://doi.org/10.1016/j. future.2013.01.010. [12] F. Hu, Q. Hao, and K. Bao. A survey on software-defned network and openfow: from concept to im­plementation. IEEE Communications Surveys&Tu­torials, 16(4):2181–2206, 2014. DOI: https:// doi.org/10.1109/COMST.2014.2326417. [13] H. Huang, J. Zhu, and L. Zhang. An sdn_based man­agement framework for iot devices. In Irish Signals &Systems Conference 2014 and 2014 China-Ireland International Conference on Information and Com­munications Technologies (ISSC 2014/CIICT 2014). 25th IET, pages 175–179. IET, 2013. DOI: http: //dx.doi.org/10.1049/cp.2014.0680. [14] R. ITU-T. P. 862. Perceptual evaluation of speech quality (PESQ): An objective method for end-to-end speechquality assessment of narrow-band telephone networks and speechcodecs, 2001. [15]Y. Jararweh,M. Al-Ayyoub,E. Benkhelifa,M.Vouk, A. Rindos, et al. Sdiot: a software defned based internet of things framework. Journal of Ambient Intelligence and Humanized Computing, 6(4):453– 461, 2015. DOI: https://doi.org/10.1007/ s12652-015-0290-y. [16] W. Lee and M. Hluchyj. Multi-criteria routing sub­ject to resource and performance constraints. In ATM Forum, volume 94, page 0280, 1994. [17]J.Li,E. Altman,andC.Touati.Ageneral sdn-based iot framework withnvf implementation. ZTE commu­nications, pages 1–11, 2015. [18]P.Martinez-JuliaandA.F. Skarmeta. Extendingthe internet of things to ipv6 with software defned net-working.Technical report, Euchina-fre, 2014. [19] N. McKeown, T. Anderson, H. Balakrishnan, G.Parulkar, L. Peterson, J. Rexford, S. Shenker, and J. Turner. Openfow: enabling innovation in cam­pus networks. ACM SIGCOMM Computer Commu­nication Review, 38(2):69–74, 2008. DOI: https: //doi.org/10.1145/1355734.1355746. [20] F. Olivier, G. Carlos, and N. Florent. New security architecture for iot network. Procedia Computer Sci­ence, 52:1028–1033, 2015. [21] N. Omnes, M. Bouillon, G. Fromentoux, and O. Le Grand. A programmable and virtualized net­work & it infrastructure for the internet of things: Howcan nfv&sdn help forfacing the upcoming chal­lenges. In Intelligence in Next Generation Networks (ICIN), 2015 18th International Conference on,pages 64–69. IEEE, 2015. DOI: https://doi.org/ 10.1109/ICIN.2015.7073808. [22] I. Rec. Y. 1541: Network performance objectives for ip-based services. InternationalTelecommunication Union, ITU-T, 2003. [23] I. Recommendation. 910,“subjectivevideo quality as­sessment methods for multimedia applications,” rec­ommendation itu-t p. 910. ITUTelecom. Standardiza­tion Sector of ITU, 1999. [24] D. S. Reeves and H. F. Salama. A dis­tributed algorithm for delay-constrained unicast rout­ing. IEEE/ACMTransactions on Networking(TON), 8(2):239–250, 2000. DOI: https://doi.org/ 10.1109/90.842145. [25] W. Ren, Y. Sun, T.-Y. Wu, and M. S. Obaidat. A hash-based distributed storage strategy of fowtables in sdn-iot networks. In GLOBECOM 2017-2017 IEEE Global Communications Conference, pages 1– 7. IEEE, 2017. DOI: https://doi.org/10. 1109/GLOCOM.2017.8254507. [26]P.K. Sharma,S. Singh,Y.-S. Jeong, andJ.H.Park. Distblocknet:Adistributed blockchains-based secure sdn architecture for iot networks. IEEE Communica­tions Magazine, 55(9):78–85, 2017. DOI: https: //doi.org/10.1109/MCOM.2017.1700041. [27] V. R. Tadinada. Software defned networking: Re­defning the future of internet in iot and cloud era. In Future Internet of Things and Cloud (Fi-Cloud), 2014 International Conference on, pages 296–301. IEEE, 2014. DOI: https://doi.org/ 10.1109/FiCloud.2014.53. [28] R. Vilalta, A. Mayoral, D. Pubill, R. Casellas, R. Martínez, J. Serra, C.Verikoukis, and R. Muz. End-to-end sdn orchestration of iot services using an sdn/nfv-enabled edge node. In OpticalFiber Commu­nication Conference,pages W2A–42. Optical Society of America, 2016. DOI: https://doi.org/10. 1364/OFC.2016.W2A.42. [29] R.Widyono et al. The design and evaluation of rout­ing algorithms for real-time channels. International Computer Science Institute Berkeley, 1994. USL:ADomain-Specifc LanguageforPrecise Specifcationof Use Cases and ItsTransformations Chu Thi Minh Hue1, Dang Duc Hanh2, Nguyen Ngoc Binh3 and Le Minh Duc4 Departmentof Software Engineering, VNU Universityof Engineering andTechnology 1HungYen UniversityofTechnology and Education,Vietnam 2Corresponding author 3Visiting professor, Hosei University, Japan 4Hanoi University,Vietnam E-mail: {huectm.di12 | hanhdd | nnbinh}@vnu.edu.vn, duclm@hanu.edu.vn Keywords: use cases, pre-and postcondition, labelled transition systems, model transformations, domain-specifc lan­guages Received: March 29, 2018 A use case model is often represented by a UML use case diagram and loosely structured textual de­scriptions. The use case model expressed in such a form contains ambiguous and imprecise parts. This prevents integrating it into model-driven approaches, where use case models are often taken as the source of transformations. In this paper, we introduce a domain-specifc language named the Use case Speci­fcation Language (USL) to precisely specify use cases. We defne the abstract syntax of USL using a metamodel together with OCL wellformedness rules and then provide a graphical concrete syntax for the usability goal.We also defnea precise semantics for USLby mapping USL models to LabelledTransition Systems(LTSs).Itopensa possibilityto transformUSLmodelstosoftwareartifactssuchastest casesand design models.Wefocusona transformationfromaUSLmodeltoa template-basedusecase description inorderto illustrateourmethod.AlanguageevaluationofUSLisalsoperformedinthispaper. Povzetek: Zasnovanjedomenskospecifˇcno specifkacijo primerovin transformacij. cni jezik USL za natanˇ 1 Introduction like [3, 4, 9, 5, 6, 7], which used natural language descrip­ tion, the works in [10, 11] proposed a formal semantics for Use case is a software artifact that is commonly used for use case. On the other hand, UML activity and sequence di-capturing and structuring the functional requirements. A agrams are proposed in [12, 13, 14, 15] to model the control usecaseis defnedas“the specifcationof sequencesofac-fowsinuse case.Anumberofotherworks[4,16,17]pro­tions, includingvariant sequencesand error sequences,that posed usinga domain specifc language (DSL)to specify a system, subsystem, or class can perform by interacting use case. DSL [18] is a language that is designed specif-with outside objects to provide a service of value” [1]. As ically for a certain domain to ease the task of describing a requirements artifact, the use case model is commonly concepts in the domain. specifed by a UML use case diagram and loosely struc- However, the main limitation of the existing work is that tured textual descriptions [2]. A key beneft of this use they do not focus on precisely capturing the relevant use case specifcation is that it is easy for non-technical stake-case information. These include control fows, steps, sys-holders to learn and use. However, the use case models ex-tem actions, actor actions, and constraints on the use case pressed in this form often contain ambiguous and imprecise and its fows. In this paper, we propose a DSL named Use parts. This prevents the models from being used directly Case Specifcation Language (USL) to overcome this lim-in model-driven approaches, as a transformation source to itation. The goal of USL is to precisely specify use cases produce other analysis and design models. An important and its model transformation abilities. The USL’s domain challenge here is how to achieve a balance between two consists in the task of specifying use cases that capture the seemingly conficting goals: to specify use case suffciently system behavior. precise for model transformation purposes, while achieving the ease-of-use requiredby non-technical stakeholders. Our approach is to defne the abstract syntax of USL To this end, a considerable number of works, including by extending the metamodels of the UML use case and [3, 4, 5, 6, 7] and those discussed in [8], have attempted to activity diagrams [2]. Our extension consists in a set of introduce rigor into use case description. More specifcally, meta-concepts needed for the following purposes: (1) to T.Yueetal.[3] proposedaddingkeywordsand restriction describethe elementsofatypicalusecase descriptiontem-rules into use case descriptions and then using natural lan-plate; (2) to represent the basic and alternate fows ofa use guage processing techniquesin orderto analyze them. Un-caseintheformof sequential, branched, repeating steps,or concurrent steps; (3) to categorize steps and actions based on the interaction subjects, which include the system, ac­tors and included/extending use cases;and (4) to represent constraints on the use case, actions, and fows. Our precise specifcation of USL makes it possible to automatically transform USL models into other software artifacts using model transformation techniques. In brief, the main contributions of our work are as follows: – ADSL namedUSLto precisely specify use cases.We defne the abstract syntax of USL using a metamodel constrained by OCL wellformedness rules [19]. For usability, we defne a graphical concrete syntax for USL. – A formal semantics specifcation for USL using La-belled Transition System [20]. This semantics en­ables the automatic transformation of USL models into other software artifacts, such as test cases and class models. – A support tool that includes a visual editor for con­structing USL models. We use this tool and two commonly-used case studies to illustrate our method. WealsoevaluateUSLby comparingitto other related languages. This work makes four signifcant improvements from our earlier conference paper [21]. First, regarding to USL specifcation, we make the abstract syntax precise with the OCL wellformedness rules [19] anddefnea graphical con­crete syntax. Second, we develop an additional case study in order to illustrate how to apply USL in practice. Third, we defne a numqber of typical model transformation sce­nariosforUSL modelandexplain,in more detail,the trans­formation into template-based use case description.Fourth, we provide an evaluation for USL. The restofthis paperisorganizedas follows. Section2 presents the background and anexample for ourwork. Sec­tion3overviews our approach. Section4presents the USL abstract syntax andexplains its formal semantics. Section5 explains how USL models are transformed into other soft­ware artifacts. Section 6 introduces our support tool and illustrates how to apply USL to the ATMsystem case study. This section also presents anevaluationof USL. Section7 comments on the related works. The paper is closed with the conclusions and future work. 2 Background and motivation Figure 1 shows a simplifed requirement model of a Library system including a UML use case model de­picted in the part (a) and a UML class diagram capturing corresponding domain concepts of the system which is pre­sented in the part (b). Our paper uses the use case Lend Book in the part (a) as a motivating example. This use case is invoked when the librarian executes the book lending Figure 1: The simplifed use case and the conceptual do­main model of the Library system. Table1:Atypical use case description template Use case name: Lend Book Brief description: The Librarian processes a book loan. Actors: Librarian. Precondition: The librarian has logged into the system successful. Postcondition: If the use case successfully ends, the book loan is saved and a complete message is shown. In the other case, the system displays an error mes­sage. Trigger:The Librarian requests a book-loan process. Special requirement: There is no special requirement. Basic fow 1. The Librarian selects the Lend Book function. 2. The system shows the Lend-book window, gets the current date and assigns it to the book-loan date. 3. The Librarian enters a book copyid. 4. The system checks the book copyid. If it is invalid, it goes to step 4a.1 5. The Librarian enters a borrower id. 6. The system validates the borrower id. If it is invalid, it goes to step 6a.1 7. The Librarian clicks the save-book-loan button. 8. The system validates the conditions to lend book. If it is invalid, the system goes to step 8a.1 9. The system saves the book loan record, then executing two steps 10 and 11 concurrently. 10. The system shows a complete message. 11. The system prints the borrowing bill. Alternate fows E1. request searched book 1. The Librarian selects the search function after step 4a.1. 2. The system executes the extending use case Search book. 4a. The book copyid is invalid 1. The system shows an error message, then going to step 3. 6a. The Borrower id is invalid 1. The system shows an error message, then going to step 5. 8a. The lending condition is invalid 1. The system shows an error message. 2. The system ends the use case. transaction. The use case is represented in a typical tem­plate as showninTable1. A typical use case description template [22] often in­cludes two parts, the overview information elements and the detailed description of fows. The frst part consists of the following elements: the use case name, the use case’s brief description, the actors participating in the use case, the use case’s precondition and postcondition,thetrigger that initiates the use case and the special requirement that describes the non-functional requirements of the use case. The second part contains two types of fows, the basic fow and alternative fows. The basic fow covers what nor­mally happens when the use case is performed. Each use case description has only one basic fow. The alternative fows cover optional or exceptional behaviour as well as the variations of the normal behaviour. Both the basic and alternative fows are often further structured into steps or subfows [23, 1]. Moreover, one can smooth use case fows to contain only a basic fow and some alternate fows. Each step in fows consists of actions performed either by the system or actors.We refer to actors, the system, and other relation use cases as interactive subjects. For example, Step 1 in the basic fow is carried out by the Librarian actor, while Step2is performedby the sys-tem.Astep may also contain the information to decide the next moving is another step or another fow or the start-ingor fnishingof concurrent actions.As illustratedinTa­ble 1, Step2 includes three system actions, “The system shows the Lend-book window”, “The system gets the cur­rent date” and “The system assigns the current time to the book-loan date”. Step5containsa branching decision, “If it is invalid, the system goes to step 6a”. Step9 contains the starting point of two concurrent actions: “The system executes two steps 10 and 11 concurrently”. In ourwork, we consider sentences describingexecution of an extending or an included use case as the system’s actions. Our previous work [21] divides use case’s actions into nine types as follows: Actor-Input is an actor action to enter data into the sys­tem, e.g., the action “The Librarian enters a book copyid” at Step3inTable1is an Actor-Input. Actor-Request is an actor action to send requests into the system, e.g., the action “The Librarian clicks the save-book-loan button” at Step 7 in Table 1 is an Actor-Request. System-Display is a system action that the system per­forms operations with the user interface, e.g., the action “The systemshowsthe lend-book window”atStep2inTa­ble1is a System-Display. System-Operation is a system action to validate a re­quest and data, or process and calculate data, e.g., the ac-tion“Thesystemgetsthe currentdate”atStep2inTable1 is a System-Operation. System-State is a system action to query or update its internal states, e.g., the action “The system saves the book loan record”atStep9inTable1isa System-State. System-Output is a system action to send outputs to the actors, e.g., the action “The system shows an error mes­sage” at Step 1 of the alternate fow 4a in Table 1 is a System-Output. System-Request is a system action to send requests to a secondary actor, e.g., the action “The system prints the borrowing bill” at Step 11 shown in Table 1 is a System-Request. System-Include is a system action to include another use case. System-Extend is a system action to extend another use case, e.g., the action “The system executes the extending use case Search book” within Step2of the alternate fow E1 inTable1is a System-Extend. Ause case is successfully executed only if the pre-and postcondition of the use case as well as of the actions of the current fow are satisfed. Within the context of model-driven development, a use case model,as illustratedinFig.1 tendstobetakenasa source model of transformations in order to obtain other software artifacts such as analysis models, design mod­els, and test cases. However, the ambiguous and imprecise parts within use case descriptions prevents us from achiev­ing such transformations. In order to integrate use cases into model-driven approaches, we aim to tackle the follow­ing challenges: Capturing the overview structure. The use case model needs to preserve the overview structure of use case de­scriptions so that a template-based representation of use cases might be generated for non-technical stakeholders. Specifyingprecisely controlfollows. Ause case includes a set of scenarios, each of which corresponds to a control fow of the use case. Therefore, the use case model needs to preserve the information of control fows of use cases. This allows us to automatically generate artifacts like test scenarios and behaviour models. Specifying precisely actions. The use case model needs to precisely represent actions within use case scenarios. Aprecise specifcation of actions allows us to capture use case relationships and to generate other artifacts from use cases such as class diagrams, test scenarios, and test ob­jects. Specifying use case constraints. For the aim to automat­ically generate test data, the use case model needs to pre­servethe constraints within use case descriptions, including the pre-and postcondition of use cases, the pre-and post­condition of use case actions, and their guard conditions. 3 Overview of the approach Figure2 illustrates our approach. First, we take as input a use case diagram, the textual descriptions of use cases, and a class diagram capturing the conceptual model of the system. Then, we aim to represent each use case specif­cation as a model element of a so-called use-case domain. In order to defne the use-case domain, we defne meta-concepts w.r.t. the structural elements of the typical use-case-description template and the use case concepts as ex­plained in Sect. 2. The meta-concepts allow us (1) to rep­resent the basic and alternate fows of a use case in form of sequential, branched, or repeating steps, (2) to categorize use case steps and actions based on the interactive subjects including the system and actors, and (3) to represent con­straints on the use case and its fows. Figure 2: Overview of the USL Approach. In order to represent textual descriptions of actions or constraints within a use case specifcation, we consider them as operations on an object-oriented model w.r.t. the input conceptual model of the system. In that way, we could employpairs ofpre-and postcondition as contracts on actions in order to obtain a more precise specifcation of the use case. The constraints are often expressed using constraint languages such as the OCL [24], JML [25], and natural language as mentioned in [17]. In this research, we employthe OCL to represent the constraints. Specifcally, our approach is realized as follows. We propose a domain specifc language named USL in order to represent use cases within the use-case domain. Fur-ther,wedefneaformal semanticsofUSLsothatwecould transform USL models in to other artifacts such as test cases and analysis class models. To illustrate this point, a transformation from a USL model to a template-based use case description willbeexplainedin detailsin SubSect. 5.3. 4 The USL language This section frstexplains the abstract syntax and the graph­ical concrete syntax of USL. Here, we utilize the meta-modeling approach as mentioned in [26] to defne USL. Then, we focus on defning a precise semantics for USL by mapping a USL model to a Labelled Transition Sys-tem(LTS) [20]. 4.1 The USL abstract syntax We defne the USL metamodel w.r.t. the use-case domain based on (1) UML use case specifcation (Chapt. 18 of [2]), (2) the Use Case Descriptions (UCDs) [1], [23], [22], and (3) the UML activity specifcation (Chapt.s 15, 16 of [2]). We will refer to these as thedomain sources (1), (2), and (3), respectively. Figure3shows the metamodel of USL.For brevity, we divide the metamodel into four blocks: (a), (b), (c), and (d). Figure 3-a(i.e., block (a)) presents the top-level con­cepts. Figure 3-b presents the FlowStep hierarchy. Fig­ure 3-c presents the ControlNode hierarchy. Figure 3­d presents the Action hierarchyand how it is related to the FlowStep hierarchy. Figure 3-e presents the con­cept Constraint and how it is used to specify Action, InitialNode, FinalNode, and FlowEdge. Toconservespace, we will not repeat here the defnitions of all of the USL concepts that are described in the three domain sources. Wewill instead focusonakeysub-setof the concepts – those that will be used later to defne the transformationof USL models. Figure4presents the USL model of the Lend Book use caseasshowninTable1.We will use this example USL model in order to illustrate our defnitions. Action (domain sources(1,3)) representsa action thatis performed eitherbyan actor orbythe system. An Action is characterisedby the following attributes: actionName and parameters. The parameters are represented by concept Parameter inherited the concept Parameter of UML (as presented in Sect. 19.9.13 of [2]). Action is specialized into two main types (as illustrated in Fig. 3-d): ActorAction and SystemAction. ActorAction is further specialized into ActorRequest and ActorInput. SystemAction is special­ized into SystemOperation, SystemOutput, SystemDisplay, SystemState, SystemInclude and SystemExtend that were explained in Sect. 2. FlowStep (domain source (2)) is a sequence of Actions that represents a step in a basic fow or an alternate fow of the use case. It is characterised by the following attributes: number (the order number of step), description (the content of the step) and maxloop (the maximum itera­tion of the step if existing). FlowStep is specialized into two types (as shown in Fig. 3-b): ActorStep and SystemStep, as mentioned in Sect. 2. We defne three utility functions as showninTable2. Example 4.1.1. The USL model shown in Fig. 4 con­sists of the FlowSteps s1, ..., s16. Among these, s1is an ActorStep and s2 isa SystemStep. Step s3 contains the ActorInput a5. Step s1 contains the ActorRequest a1. Step s2 con­ tains SystemOperation a3. Step s10 con­ tains the SystemOutput a12. Step s4 con­ tains the SystemState a6. Step s11 contains the SystemRequest a13. Step s14 contains the SystemExtend a16. The Action a5 has Parameter “bcid”. Control Node (domain source (3)) represents a control action that regulates the fows across other USLNodes. A ControlNode, as illustrated in Fig. 3-c, is specialized into InitialNode, FinalNode, DecisionNode, ForkNode and JoinNode. These respectively represent the starting and ending points of use case, the branching points of steps, and the starting and ending points of con­current actions in steps. To ease notation, we defne two overloading functions w.r.t. ControlNode and a func­tion w.r.t. DecisionNode as showninTable2. Example 4.1.2. The USL model as shown in Fig. 4 contains nine ControlNodes c0, ..., c8. In partic­ular, c0 is an InitialNode, c7 and c8 are different FinalNodes, c1, ..., c3 and c6 are DecisionNodes, c4 is a ForkNode, andc5isa JoinNode. USLNode represents all the nodes FlowStep or ControlNode that make up a USL model. FlowEdge (domain source (3)) is a binary directed edge between two USLNodes. If both steps are a part of a basic fow, we call the transition a BasicFlowEdge. On the otherhand,ifbothstepsareapartofan alternatefow,we call the transition an AlternateFlowEdge. As shown in Table 2, we defne two utility functions source and target, two overloading functions guardE and a func­tion isCompleted w.r.t. the concept FlowEdge. Example 4.1.3. The USL model as shown in Fig. 4 contains b1, ..., b18 as BasicFlowEdges and al_1, ..., al_10 as AlternateFlowEdges. Variable (domain source (3)) represents variables that Figure 3: The USL metamodel. Table 2: List of utility functions w.r.t. USL concepts Utility function Description firstAct:FlowStep › Action lastAct:FlowStep › Action actions:FlowStep › Actions firstAct:ControlNode › ControlNode lastAct:ControlNode › ControlNode source:FlowEdge › USLNode target:FlowEdge › USLNode guardE:FlowEdge › Constraint guardE:USLNode › USLNode › Constraint isCompleted:FlowEdge › Boolean preA:Action › Constraint preA:ControlNode › Constraint postA:Action › Constraint postA:ControlNode › Constraint preC:USLModel › Constraint postC:USLModel › Constraint postC:USLModel › FinalNode › Constraint Returning the frst Actions of a FlowStep. Returning the last Actions of a FlowStep. Returning a set of Actions of a FlowStep. Returning the ControlNode itself. Returning the ControlNode itself. Returning the source USLNodes of a FlowEdge. Returning the target USLNodes of a FlowEdge. Returning the guard condition. Taking the source and target USLNodes as input and returning the guard condition. Determining whether or not lastAct(source(e)) has completed its exe­cution. Returning the precondition of an Action. If the ControlNode is nota InitialNode, returning true, else return­ing the Constraint of the InitialNode Returning the postcondition of an Action. If the ControlNode is nota FinalNode, returning true, else returning the Contraint of the FinalNode. Returning the precondition of a USLModel. Returning the postcondition of a USLModel. Returning the postcondition of a particular FinalNode of a USLModel. hold data values during the execution of a use case sce­nario. It is inherited the concept Variable of UML pre­sented Sect. 15.7.25 of [2]. DescriptionInfor (domain source (2)) maintains the other textual description of use case. Constraint (domain source (1,3)) represents constraints that are formed by use case variables: (1) the precondi­tion of use case associated with InitialNode; (2) the postcondition of use case associated with FinalNodes; (3) guard conditions of a transition; and (4) the pre-and postcondition of an Action. This concept is inherited the concept Constraint in UML, shown in Sect. 7.6 of [2]. As depicted in Table 2, we defne utility functions w.r.t. Constraints to get the pre-and postcondition of ac­tions and use case. Example 4.1.4. The USL model as shown in Fig.4 con-tainsg1,...,g6asguard conditionsandp1,...,p6aspost­conditions of Actions. We defne a set of OCL wellformedness rules as restric­tions on the USL metamodel. These rules are defned in the context of the UseCase concept and listed as follows. Rule 1. AUSL model has oneInitialNode: 1 self.uslnode->selectByType(InitialNode) 2 ->size()=1 Rule 2. AUSL model has at least oneFinalNode: 1 self.uslnode->selectByType(FinalNode) 2 ->size() >= 1 Rule 3. AUSL model has at least oneFlowStep: 1 self.uslnode->selectByKind(FlowStep) Figure 4: Representing the Lend Book use case as a USL model. 2 ->size()>=1 Rule 4. An InitialNode has one out-going BasicFlowEdge and does not have any in-coming FlowEdges: 1 (self.flowedge->select(t:FlowEdge|t.source. oclIsTypeOf (InitialNode))->size()=1)and( self.flowedge ->select (b:FlowEdge|(b. source.oclIsTypeOf(InitialNode)) and (b. oclIsTypeOf(BasicFlowEdge)))->size()=1) and (self.flowedge->select(t:FlowEdge| t .target.oclIsTypeOf(InitialNode))->size() =0) Rule 5. A FinalNode has one in-coming FlowEdge and does not have anyout-going FlowEdge: 1 self.uslnode->selectByType(FinalNode)-> forAll (f:FinalNode|(self.flowedge-> select(e:FlowEdge|e.target=f) ->size() =1) and (self.flowedge->select (e: FlowEdge|e.source=f)->size()=0)) Rule 6. A DecisionNode has one in-coming FlowEdge and at least two out-going FlowEdges: 1 (d:DecisionNode|(self.flowedge->select(e: FlowEdge|e.target=d)->size()=1)and ( self.flowedge->select(e:FlowEdge|e. source=d)->size()>=2) ) Rule 7. A ForkNode has at least one in-coming FlowEdge and at least two out-going FlowEdges: 1 self.uslnode->selectByType(ForkNode)->forAll( f:ForkNode|(self.flowedge->select(e: FlowEdge|e.target=f)->size()>=1) and ( self.flowedge->select(e:FlowEdge|e.source =f)->size()>=2) ) Rule 8. A JoinNode has at least two in-coming FlowEdges and one out-going FlowEdge. 1 self.uslnode->selectByType(JoinNode)->forAll (j:JoinNode|(self.flowedge->select(e: FlowEdge|e.target=j)->size()>=2)and(self. flowedge->select (e:FlowEdge|e.source=j) ->size()=1)). Rule 9. ASystemStep or ActorStep has at least one in-coming FlowEdge and one out-going FlowEdge: 1 self.uslnode->selectByKind(FlowStep)->forAll( f:FlowStep|(self.flowedge->select(e: FlowEdge|e.target=f)->size()>=1)and(self. flowedge->select(e:FlowEdge|e.source=f)-> size()=1)) Rule 10. A USL model is valid if the FlowEdges that connect the USLNodesof the model arevalid, i.e., the type and label are correctly defned: 1 self.uslnode->forAll(n:USLNode| 2 if (n.oclIsTypeOf(InitialNode))then 3 self.flowedge->select(b:FlowEdge| 4 (b.source.oclIsTypeOf(USL::InitialNode)) and (b.oclIsTypeOf(USL::BasicFlowEdge)))-> size()=1 5 else 6 if (self.flowedge->selectByType( BasicFlowEdge)->select(b:BasicFlowEdge|b .target=n))->size()>=1 then 7 if (n.oclIsTypeOf(DecisionNode))then 8 self.flowedge->selectByType(BasicFlowEdge) ->select (b:BasicFlowEdge|b.source=n) ->size()=1 9 else 10 if (n.oclIsTypeOf(ForkNode)) then 11 self.flowedge->select(f:FlowEdge|f.source =n)->forAll(b:FlowEdge|b.oclIsTypeOf( BasicFlowEdge)) 12 else 13 if (n.oclIsTypeOf(JoinNode)) then 14 (self.flowedge->select(f:FlowEdge|f. source=n)->forAll(b:FlowEdge|b. oclIsTypeOf(BasicFlowEdge)))and(self .flowedge->select(f:FlowEdge|f. target=n)->forAll (b:FlowEdge|b. oclIsTypeOf(BasicFlowEdge))) 15 else 16 if(n.oclIsKindOf(FlowStep)) then 17 self.flowedge->select(f:FlowEdge|(f. source=n) and (f.oclIsTypeOf( BasicFlowEdge)))-> size() = 1 18 else true 19 endif 20 endif 21 endif 22 endif 23 else ((self.flowedge->selectByType( BasicFlowEdge)->select(b:BasicFlowEdge|b. source=n))->size()=0) and if(n. oclIsTypeOf(FinalNode))then 24 true else self.flowedge ->selectByType ( AlternateFlowEdge) ->exists (f: AlternateFlowEdge|f.label=self. flowedge->selectByType( AlternateFlowEdge)->select(a: AlternateFlowEdge|a.target=n)->first() .label)endif) 25 endif 26 endif) Rule 11. The number property of each FlowStep in a Basic fow is unique: 1 self.uslnode->selectByKind(FlowStep)->select( n:FlowStep|self.flowedge->selectByType( BasicFlowEdge)->exists(t:BasicFlowEdge|(t .source=n)or(t.target=n)))->forAll(n1: FlowStep, n2:FlowStep|n1.number=n2.number implies n1=n2). Rule 12. The number property of each FlowStep in an Alternate fow is unique: 1 self.uslnode->selectByKind(FlowStep)->select (n:FlowStep|self.flowedge->selectByType( AlternateFlowEdge)->exists(t: AlternateFlowEdge|(t.source=n)or(t.target =n)))->forAll(n1:FlowStep,n2:FlowStep|(n1 .number=n2.number)and(self.flowedge-> selectByType(AlternateFlowEdge)->select( t1:AlternateFlowEdge|t1.target=n1)->first ().label=self.flowedge->selectByType( AlternateFlowEdge)->select(t2: AlternateFlowEdge|t2.target=n2)->first(). label) implies n1=n2). Example 4.1.5. Let us focus on the USL model as shown in Fig. 4: – If we remove c0 from or add a new InitialNode to this model then it will violate Rule 1. – If we remove both c7, c8 from the model then it will violate Rule 2. – If the model only has ControlNodes then it will violate Rule 3. – If FlowEdge b1is nota BasicFlowEdge but an AlternateFlowEdge then the model will violate Rule 4. – If we connect b17 to c8 and remove c7 then the model will violate Rule 5. – If we add an AlternateFlowEdge to connect s4 toc6 then the model will violate Rules6and9. – If we remove b14, s11, b16, p3 then the model will violate Rules7and8. – If either FlowEdge b6is nota BasicFlowEdge but an AlternateFlowEdge or the value of the label property of AlternateFlowEdge al_1 is not “4a” then the model will violate Rule 10. – If the value of the number propertyofs2is not2but 1then the model will violate Rule 11. – If the value of the number property of s14 is not2 but1then the model will violate Rule 12. 4.2 The USL concrete syntax In order to help the user to easily create USL models, we propose a concrete syntax for USLwith the graphical no­tations as shown in Table 3. We have implemented this syntax in a visual editor for USL modelling. A detailed explanation of this tool will be presented in Sect. 6. 4.3 Formal semantics of USL We formally defne a USL model as follows. Here, we consider a USL model as a graph consisting of nodes and edges. Anode represents either a step or a control action performedbythe system. Further,we will takeinto account thefact that the underlying use case references the domain concepts captured in a UML class diagram. Defnition 1. A USL Model of a use case is the tuple D = hDC , A, E, Ci such that: – DC is a class diagram to represent the underlying do­main; – A is the set of USLNodes; – E is the set of FlowEdges; – C = G . CpreUC . CpostUC . CpreA . CpostA is the set of Constraints, where: – A = AcNode . Af ; – AcNode = NI . NF . Nd . Nj . Nf , where NI = {a | a . A, InitialNode(a)} NF = {a | a . A, FinalNode(a)}, Nd = {a | a . A, DecisionNode(a)}, Nj = {a | a . A, JoinNode(a)}, Nf = {a | a . A, ForkNode(a)}; – |NI | =1;|NF |. 1; – Af = Aa . As, where Af = {a | a . A, FlowStep(a)}, Aa = {a | a . A, ActorStep(a)}, As = {a | a . A, SystemStep(a)}; – |As|. 1;.s . Af .|actions(s)|. 1; – E = Eb . Ea and Eb . Ea = Ř, where Eb = {e | e . E, BasicFlowEdge(e)}, Ea = {e | e . E, AlternateFlowEdge(e)}. C.T.M. Hue et al. Table 3: The graphical notations of USL Concepts Presentation Notation DescriptionInfor Aborderless text box that properties are listed in the text box InitialNode An unflled circle FinalNode Acircle with a crosshairs symbol DecisionNode Aflled diamond with one in-coming arrowed line and at least two out-going arrowed lines ForkNode Asolid line segment with one in-coming arrowed line and at least two out-going arrowed lines JoinNode Asolid line segment with at least two in-coming arrowed lines and one out-going arrowed line BasicEdge Athick arrowed line AlternateEdge Alabelled, thin arrowed line (the label is the name of the fow) ActorStep Alabeled, 2-part rectangle. The frst part contains the label and two properties numberStep and description of the ActorStep. The second part contains the ActorActions of the ActorStep SystemStep Alabelled, 2-part rectangle. The frst part contains the label and two properties numberStep and description of the SystemStep. The second part contains the SystemActions of the SystemStep Action Information of a Action are presented by textual form in the second part of FlowSteps Example 4.1.5. The USL model as shown in Fig. 4 contains the following elements: NI = {c0}; NF = {c7,c8}; AcNode = {c0, ..., c8}; Aa = {s1,s3,s5,s7,s13}; As = {s2,s4,s6,s8, ..., s12,s14,s15,s16}; Eb = {b1, ..., b17}; Ea = {al_1, ..., al_10};G = {g1, ..., g6};CpreUC = Ř; CpostUC = Ř;CpreA = Ř;andCpostA = {p1, ..., p6}. DC corresponds to the conceptual model shown in the part (b) of Fig. 1. There are sixteen constraints for guard conditions and pre-and postconditions, e.g., the postcondition p1 of Action a11 isexpressedby the following OCL contraint: BookLoan.allInstances()->exists(b:BookLoan| (b.bcid = bcid) and (b.bid = bid) and (b.payed=0)). We useLTS [20] to formally defne the operational se­mantics of USL. Conceptually, the execution of a USL modelis modelledbyanLTS, whose transitions are caused by the execution of use case actions, and whose states are defned by variable assignments during the execution. We defnetheLTSofaUSL model recursively fromthe basic USL concepts. The semantics of these concepts are defned as summarized inTable 4. Defnition2formalizes the no-tionof theLTSof the USL model. Defnition 2. Given a USL model D = hDC , A, E, Ci, an LTS that results from the execution of D is the tuple h.(V), P(G×A×P), T ,.init, Fi such that: Table4:LTS-based semanticsof the basic USL concepts – V is a fnite set of variables whose types include the basic types and the classes of the DC ; – .(V) is the set of states(.), each of which is a set of value assignments to a subset of variables in V; – P. CpostA . CpostUC is the set of constraints as the postconditions of D; – A = AcNode . Aact is the set of actions; – G. G.CpreUC .CpreA is the set of guard conditions of the transitions; – T. .(V) × P(G×A×P) × .(V) is the tran­sition relation defned as follows: A transition t = g|a|r (., (g, a, d),.0) .T , written as . -› .0, where a .A is the action that causes t, g = defGuard(a) . G is the guard condition to execute a, r .P is the postcondition of a, and ., .0 . .(V) are the pre-and post-states of t (resp.) such that .0 satisfes r; – .init . .(V) is the initial state; – F. .(V) is the set of fnal states, where: S – Aact = s.Af actions(s); – defGuard is defned as follows (summarized fromTa­ble 4). preC(D), ifInitialNode(a) Librarian:“100”, Librarian:“111”, BookLoan:“1”, guardE(e)(e . D.E, target(e)= a), ifDecisionNode(a) BookLoan:“2”}. Transition . ForkNode(a) . FinalNode(a) isCompleted(e) . guardE(e), V ta11,c4 = .a11 true|c4|true -› .c4. . . . . . . . . . . .. (e.D.E,target(e)=a) if JoinNode(a) reachable(.a11) ={ta11,c4} and firable(.a11)= = preC(DI ) . preA(a) . guardE(e)(e . D.E, target(e)= a), {ta11,c4}. if SystemInclude(a) Defnition 4. Givena current state . of anLTS L ofa USL . . . . . . . . . . preC(DX ) . preA(a) . guardE(e)(e . D.E, target(e)= a), if SystemExtend(a) preA(a) . guardE(e)(s . Af , target(e)= s), if((a . Aact) . (a = firstAct(s)) preA(a)(s . Af ,a . actions(s)), ifotherwise model D,a concurrent transition . . L.T is a set of transitions t1,t2,...,tn . firable(.). Example 4.3.3. When the USL model as shown in Fig. 4 executes at Step c4, we have two transi­ true|a12|p2 tions tc4,a12 = .c4 -› .a12 and tc4,a13 = true|a13|p3 .c4 -› .a13, reachable(.c4)= {tc4,a12,tc4,a13} and firable(.c4)= {tc4,a12,tc4,a13}. Hence,{tc4,12, tc4,13} is a concurrent transition and .a12, .a13 satisfy p2, p3, respectively. Within our approach theLTSofa USL model may con­tain both concurrent and non-concurrent transitions. We next defne the semantics of a use case scenario. Defnition 5. Given a use case scenario ofa USL Example 4.3.1. We assume that the snapshot shown in Fig. 5 is captured when the USL model as shown in Fig. 4 is executed at Step a8. We have the follow­ing value assignments: (bcid, “001”) . bcid = “001”, (lid, “110”) . lId = “100”, (ldate, “25/8/17”) . ldate = “25/8/17”, (bid, “1234”) . bcid = “1234”. The objects of the snapshot are as follow: BookCopy:“001”, BookCopy:“002”, Borrower:“123”, Borrower:“124”, Librarian:“100”, Librarian:“111”, BookLoan:“1”. Then, we have .a8 = {(bcid, “001”), (ldate, “001”), (lid, “110”), (bid, “124”), (bLoan, (“2”, “001”, “124”, “110”, “25/8/17”, 0)), BookCopy:“001”, BookCopy:“002”, Borrower:“123”, Borrower:“124”, Librarian:“100”, Librarian:“111”, BookLoan:“1”}. Certain use case actions are concurrent actions, whose executions cause concurrent transitions between states. The next two defnitions defne precisely what this means. Defnition 3. Givena current state . of anLTS L ofa USL g|a|r model D and a transition t = . -› .0 . L.T , we defne the following terms: – preT(t)= ., postT(t)= .0 , guard(t)= g, postC(t)= r, and act(t)= a. – eval(g) is the evaluation of Constraint g. – reachable(.)= {t | preT(t)= .} is the set of transitions that start from .. – firable(.)= {t . reachable(.), eval(guard(t)) = true} is the set of transitions that can be fred from .. Example 4.3.2. When the USL model as shown in Fig.4 executes at action a11, we have .a11 = {(bcid, “001”), (lDate, “001”), (lid, “110”), (bid, “124”), (bLoan, (“2”, “001”, “124”, “110”, “25/8/17”,0)), BookCopy:“001”, BookCopy:“002”,Borrower:“123”, Borrower:“124”, model D that consists of the following sequence of actions (a0,...,an-1). The execution of this scenario is realized t0t1tn-1 asapathin theLTS L of D: p = .0 › .1 › · ·· › .n, gi|ai|ri where ti = .i -› .i+1 (.i =0, . . ., n - 1), .0 = L..init, .n . L.F, and ti . L.T . Example 4.3.4. When the USL model as shown in Fig.4 executes at Step .a11 as mentioned above and eval(g1), eval(g3), and eval(g5) are true, then the use case sce­nario is as follows: true|a1|true true|a2|true true|a3|true p = .init -› .a1 -› .a2 -› true|a4|true true|a5|true true|a6|true .a3 -› .a4 -› .a5 -› true|c1|true g1|a7|true true|a8|true .a6 -› .c1 -› .a7 -› true|c2|true g3|a9|true true|a10|true .a8 -› .c2 -› .a9 -› g5|a11|true true|c4|true {true|a12|p2,true|a13|p3} .a10 -› .a11 -› .c4 -› true|c5|true true|c7|true .a12-a13 -› .c5 -› .c7 (.c7 .F). 5 Transforming USL models to other software artifacts This sectionexplainshowUSL models canbe transformed to software artifacts including test cases, structural and be­havioral models, and textual template-based use case de­scriptions (TUCDs). We particularly focus on the last transformation (to obtain TUCDs) and show how the trans­formation could be realized. 5.1 Generating test cases Atest scenario is used to create a set of test cases [27]. A test case results from combining a test scenario with some test data. According to the use case-driven testing approach [27], a use case scenario identifes one test scenario (a use case description consists of one or more use case scenar­ios). The constraintsofa use case scenariohelp identifying the test data of the corresponding test scenario. The model-based testing (MBT) method [28] presents a specifc technique for automatically generating test cases from a use case model. Specifcally, the control fows of a use case model are used to generate the use case scenar­ios. For example, Linzhang [29] frst presents a technique to represent the control fows using UML activity diagram. He then proposes an algorithm to traverse all the possible basic paths of the activity diagram to generate the test sce­narios. Two other works [30, 31] focus on the problem of auto­matically generating test data from the test scenario con­straints, written in OCL. They develop OCL constraint solvers for this task. Since our USL captures the necessary information ele­ments of the use case description, we argue that USL mod­els can also be used as an input to generate test cases. More specifcally, USL has meta-concepts for representing the different control nodes of the UML activity diagram. Fur­ther, the Constraint meta-concept of USL captures the different types of constraints that are needed to generate test data. 5.2 Generating structural and behavioural models In the requirement analysis activity, the behaviours de­scribed in a use case description are analysed in order to create other structural and behavioural models. The target models are often represented using UML diagrams, includ­ing activity diagram, class diagram, collaboration diagram, and sequence diagram. D. Savic´et al. [16] and M. Smialek et al. [17] propose a specifc method for the above. In particular, they frst use different types of actions to precisely model the use case behaviours. Theythen presenta model transformation technique that automaticallytransforms the behaviours and other relevant model elements into a class diagram. Ex­amples of these elements that are discussed in [28] include sender and receiver objects, messages, and parameters. Our USL specifcation was inspired by this work. Specifcally, we use Action meta-concept to represent use case behaviours and the relevant model elements dis­cussed above. Regarding to behavioural modelling, a USL model can be used as input to generate activity and se­quence diagrams. The reason is because USL represents all the control nodes of UML activity diagram. For exam­ple, a specifc technique for generating sequence diagram is presented in [12]. 5.3 Generating TUCDs According to [32, 33, 34], textual template-based use case descriptions (TUCDs) [1, 22, 8] enable the customer to positively participate in requirement analysis, to identify and resolve conficts in the requirement drafts, and to en­sure thatitis consistent with their intention.Table1shown earlier is an example of such a template. Informatica 42 (2018) 325–343 335 In order to automatically generate a TUCD from a USL model, we develop a transformation USL2TUCD using the model-to-text transformation language Acceleo [18]. The transformation USL2TUCD is shown in Listing 1. We illustrate this transformation using the USL model of the use case named Withdrawal (shown in Fig. 9). The output TUCDisatext fle named Withdrawal.txt that is shown in Fig. 10. Briefy, the USL2TUCD transformation uses fve queries to extract information from the input USL model (uc). The frst query is getBasicFlow(uc) at line 19. It is used is to fnd all the BasicFlowSteps in uc. The second query is getDecisionNode (uc) at line 25. It is used to get all the DecisionNodes in uc. The third query is getPreAlternateFlowLabel(uc,d) at line 27. It is used to get the label of the in-coming AlternateFlowEdge of some DecisionNode d in uc. It returns empty if no such AlternateFlowEdges exist. The fourth query is getAFEdges(uc,d) at lines 29 and 37. This query is used to get the out-going AlternateFlowEdges from a DecisionNode d in uc. The ffth query is getAlternateFlow(uc, l) at lines 30 and 39. This query is used to fnd the FlowSteps in the AlternateFlow in uc that is labeled l. The defnitions of all fve queries are written in another transformation named libraryUCD. The transformation is shown in Listing 2. Listing1: TheUSL2TUCDtransformation 1 [module GenUCDescription(’http://eclipse .USLModel/USL’)] 2 [import org::eclipse::acceleo::module:: sample::service::libraryUCD] 3 [template public generateElement(uc: UseCase)] 4 [comment @main/] 5 [file (uc.descriptioninfor->r at(0). useCaseName.concat(’.txt’), false, ’ UTF-8’)] 6 [let d:DescriptionInfor =uc. descriptioninfor-> at(0)] 7 ---------------------------------­8 UC name: [d.useCaseName/] 9 Description: [d.description /] 10 Actor: [for(a:String|d.actor)] a, [/for] 11 Level abstract: [d.levelAbstract /] 12 Precondition: [d.preCondition /] 13 Postcondition: [d.postCondition /] 14 SpecialRequirement: [d. specialRequirement/] 15 [/let] 16 ---------------------------------­17 BasicFlow 18 ---------------------------------­19 [let Bsteps:OrderedSet(FlowStep)= getBasicFlow(uc)] 20 [for(s:FlowStep|Bsteps)] 21 [s.number/]. [s.description/] 22 [/for] [/let] 23 ---------------------------------­ 24 AlternateFlow 25 [let dList:OrderedSet(DecisionNode)= getDecisionNode(uc)] 26 [for(d:DecisionNode|dList)] 27 [let preALabel:String= getPreAlternatFlowLabel(uc, d)] 28 [if(preALabel=’’)] 29 [for(af:AlternateFlowEdge|getAFEdges( uc, d))] 30 [let Asteps:OrderedSet(FlowStep)= getAlternateFlow(uc, af.label)] 31 [af.label/]. [af.description] 32 [for(s:FlowStep|Asteps)] 33 [s.number/]. [s.description/] 34 [/for] [/let] 35 [/for] 36 [else] 37 [for(af:AlternateFlowEdge|getAFEdges( uc, d))] 38 [if(af.label <>preALabel)] 39 [let Asteps:OrderedSet(FlowStep)= getAlternateFlow(uc, af.label)] 40 [af.label/]. [af.description/] 41 [for(s:FlowStep|Asteps)] 42 [s.number/]. [s.description/] 43 [/for] [/let] 44 [/if] 45 [/for] 46 [/if] [/let] 47 [/for] [/let] 48 ---------------------------------­ 49 [/file] 50 [/template] Listing 2: The libraryUCD transformation 1 [comment encoding = UTF-8 /] 2 [module libraryUCD(’http://eclipse. USLModel/USL’)] 3 [query public getBasicFlow(uc:UseCase): OrderedSet(FlowStep)=uc.uslnode-> select (n:USLNode|uc.flowedge-> selectByType (BasicFlowEdge)->exists( b:BasicFlowEdge|(n=b.source) or (n=b. target)))->selectByKind(FlowStep)/] 4 5 [query public getAlternateFlow(uc: UseCase,l:String): OrderedSet( FlowStep)=uc.uslnode->select(n: USLNode|uc.flowedge->selectByType( AlternateFlowEdge)->select(a: AlternateFlowEdge|a.label=l)->exists( f:AlternateFlowEdge|(f.target=n)or(f. source=n)))->selectByKind(FlowStep)/] 6 7 [query public getAFEdges(uc:UseCase, d: DecisionNode): OrderedSet( AlternateFlowEdge) =uc.flowedge-> select(f:FlowEdge|(f.source=d)and(f. oclIsTypeOf(AlternateFlowEdge)))-> selectByType (AlternateFlowEdge) /] 8 9 [query public getDecisionNode(uc:UseCase ): OrderedSet(DecisionNode)=uc. uslnode->selectByType(DecisionNode) /] 10 11 [query public getPreAlternatFlowLabel ( uc:UseCase,d:DecisionNode):String= 12 if uc.flowedge->selectByType( AlternateFlowEdge)->select(f: AlternateFlowEdge| f.target=d)->size ()>0 then 13 uc.flowedge->selectByType( AlternateFlowEdge)->select(f: AlternateFlowEdge| f.target=d)->at (0).label 14 else 15 ’’ 16 endif /] 6 Tool support and evaluation In this section, we frst describe a USL tool that we have developed for visually creating USL models. After that, we explain two case studies for USL.We conclude this section with an evaluation of USL. 6.1 Tool support Wedevelopeda support toolfor our approachas illustrated in Fig. 6. This USL tool provides three main functions. The frst function (displayed on the left of the fgure) is called the “Loading function”. It is responsible for loading the use cases and domain concepts of a system from a UML use case diagram anda class diagram. The second function (shown on the right of the fgure) is called “USL Editor”. It is used to create the USL models for the loaded use cases. This editor has a user-friendly GUI. The third function is called “Generating Artifacts”. It automatically generates other software artifacts. In our tool, the “Loading function” was developed using a Java project. The “USL Editor” was implemented using an EMF project and an GMF project within the Eclipse tool [26]. Specifcally, the EMF project is tobuild the abstract syntaxofUSLandtheGMF projectistobuildthe concrete syntax and to implement the OCL constraint rules on the metamodel. The “Generating artifacts” function was writ­ten using model transformation languages, such as M2T and M2M [18]. To illustrate, Fig.8 showsa USL model for the use case Session, that is created by the “USL Edi­tor”. Figure10showsaTUCDtextflethatis automatically generated by a transformation that was specifed earlier in Listings1 and 2. This transformation was written in the Acceleo M2T language. Note that when working with a generation relationship between use cases, the modeler needs to create USL models only for the specifc use cases rather than for the abstract ones. Figure 6: The USL tool. 6.2 Case study In order to demonstrate the applicability of our method, we chose another system case study named ATM, which is de­scribed in Bjork [35]. The system includes three actors, seven specifc use cases, one abstract use case, and two use case relationships. Figure7shows the use casesof the ATM system. Figure8 and Fig.9 show two USL models corresponding to these two use cases: Session and With­drawal. Figure 10 and Fig. 11 show two TUCD text fles that are generated from these two USL models, by apply­ing the function “Generating Artifact”. These fles are the use case descriptions of the two corresponding use cases. Figure 7: The use case diagram of the ATM system. 6.3 Language evaluation This section presents our evaluation of USL’s expressive­ness, compared to fve languages: RUCM [3], UC-B [10], MBD-L1 [4], SiLabReq [16] and RSL [17]. We use the following four sub-criteria of expressiveness: C1. Template-based representation of use case descrip­tions C2. Control fow-based representation of use case be-haviour 1‘MBD’ stands for the author’s names, ‘L’ for language. C3. Action specifcation C4. Use case constraint representation Table5lists theevaluation results for the above criteria. In the table, we use three letters ‘F’, ‘I’, ‘N’ to denote the specifcation method that is used for each language: ‘F’ denotes formal specifcation method, ‘I’ denotes informal specifcation method and ‘N’ denotes that the specifcation method is not discussed. Table 5: Expressiveness comparison between use case specifcation languages Use case information RUCM [3] UC-B [10] MBD-L [4] SelabReq [16] RSL [17] USL (c1) Overview elements I N F N N F (c1) Flows of use case I I F N N F (c1) Use case scenarios N N N F F N (c2) Control fows I N F N F F (c2) Concurrent actions N N N N N F (c3) Action types I N F F F F (c4) Use case scenario’s pre-and postcondition I F F N I F (c4) Guard conditions I F F N I F (c4) Action’s pre-and postcondition N F N N N F We will discuss in detail the results shown in the table in the frst fve subsections that follow. In the last subsection, we discuss the possibility of applying USL in practice. 6.3.1 Template-based representation of use case descriptions As discussed in Sect. 4, USL enables us to cap­ture all the information elements of the use case de­scription template shown in Table 1. In particu­lar, the elements of overview information are described by the properties of the DescriptionInfo object in the model. The steps in a basic fow are rep­resented by FlowSteps (including ActorStep and SystemStep)and are connectedbyBasicFlowEdges and ControlNodes. Similarly, steps in an alter­nate fow are represented by FlowSteps (including ActorStep and SystemStep)and are connected by ControlNodes and AlternateFlowEdges. USL represents this template precisely using the corresponding USL meta-concepts. Briefy, we draw the following con­clusions fromTable5: – USL is more expressive and more precise than three other languages, namely UC-B, SilabReq, and RSL. – USLis more precise thanRUCM. – USL is as expressive and precise as MBD-L. Specifcally, our USL is more expressive than UC-B, SilabReq, and RSL because of the following reasons. First, Figure 8: Modelling use case Session in the USL Editor tool. the use case information elements that are captured in USL are more formal than what are represented in UC-B. UC­B provides a GUI for informally describing use case sce­narios. Second, UC-B only represents the steps of a use case scenario and the trigger of a use case. SilabReq and RSL only capture fows corresponding to use case scenar­ios.With USL, we canexpress more use case information, such as the pre-and postcondition of an action. On the other hand, USL captures information elements asexpressively asRUCM. TheRUCM method proposesa Restricted Use Case Modeling (RUCM) language, using a setofkeywords and restricted description rules. Specifca­tionsinUSLaremoreformalthanthoseinRUCM, because RUCM’s specifcations are expressed in natural language. In comparison with MBD-L, USL lacks concepts for specifying sub-fows. However, as discussed in Sect. 2, use cases containing sub-fows can be smoothed so that theyare suitable for modelling in USL. On the other hand, MBD-L is only specifed with the abstract syntax. Unlike USL, it does not contain a concrete syntax and a formal semantic. 6.3.2 Control fowrepresentationfor use case behaviour Our USL language is built on UML activity diagram. A USL model includes USLNodes (corresponding to Nodes in UML activity diagram) and FlowEdges (correspond­ing to Edges in UML activity diagram) to specify con­trol fows which pass through steps in the use case’s fows. USL captures the different control fow types of UML ac­tivity diagram (such as sequence, branch, loop, and con­currence fow). In addition, USL also specifes steps witha limited number of iterations. For example, in the use case Session in SubSect.6.2,Step4 executesa maximumof three times. Briefy, we draw the following conclusions fromTable 5: – USL can represent concurrent steps, while the other lan­guages do not. – USL directly represents control fows using USLNodes and FlowEdges, while the other languages do not model the fows directly. Figure 9: Modelling use case withdrawal in the USL Editor tool. over, these works do not directly specify control fows. Theyonly capture rejoin points or refer to other steps. 6.3.3 Action specifcation As discussed in Sect. 4, USL precisely specifes use case behaviors using nine action types. These action types are represented by meta-concepts in the USL metamodel. The action type of eachbehavior enables us to identify sender objects, receiver objects, messages, parameters of actions and object types. Briefy, we draw the following conclu­sions fromTable5: – Action type coverage: – USL represents all the action types that are supported in other languages. – USL complements several action types, compared to four related languages, MBD-L, SiLabReq, RSL, and RUCM. USL employs two new action types IncludeAction and ExtendAction to repre­sent use case relationships. – Precise specifcation: – USL uses the USL meta-concepts to represent actions. – The actions in USL are precisely specifed using pre-and postconditions. Some languages, e.g., MBD-L, SiLabReq, and RSL also support this feature. Others, namely UC-B andRUCM,do not support it. We use more action types to classify behaviors and we capture the behavior’s information more precisely. More specifcally, by using different concepts in USL to spec­ify action types our approach captures behaviors more for­mally than UC-B. In UC-B, behaviors are not precisely specifed and are divided into different action types. Simi­larly, behaviors are better capturedin USL thaninRUCM, because the latter only useskeywords and restricted rules in natural language to divide behaviors into action types. Moreover,RUCM does not support the action type named SystemDisplay, that is captured in USL. In comparison with MBD-L, SiLabReq, and RSL, ac­tions in USL are better classifed with nine action types. MBD-L uses only four categories of action types: Re­quest, DataValidate, Expletive and Response. Similarly, SeLabReq divides actions into four groups: Actor pre­pares Data (APDExecuteSO),Actor calls System(ACSExe­cuteSO), System executes SystemOperation (SExecuteSO), and System replies and returns Result (SRExecutionSO). The classifcation method of RSL is less specifc than USL’s, because it does not support the type of system ac­tion that sends a request to a primary actor. This system action typeis specifedin USLby SystemRequest. 6.3.4 Constraint representation USL employs OCL to defne constraints in use case. Specifcally, a use case’s precondition is specifed by a Constraint associated with the InitialNode. A use case scenario’s postcondition is specifed by a Constraint associated witha FinalNode of scenario. Similarly, guard conditions on fows and actions’ pre-and postconditions are captured by Constraint associated with FlowEdges and actions, respectively. Briefy, we draw the following conclusions fromTable5: – USL supports a more complete set of constraints than four other languages, namely RUCM, MBD-L, SiLabReq, and RSL. – Constraint representation in USL (using OCL) is more precisethantwoother languages,RUCMandRSL(these languages use natural language to write constraints). USL specifes more constraint types than four other lan-guage:RUCM, MBD-L, SiLabReq, and RSL. UnlikeUSL, these languages do not support actions’ pre-and postcon­dition. Moreover, USL is better than RUCM and RSL in terms of precision, because several languages, such as MBD-L, SiLabReq, and RSL, also support this feature. The other languages, UC-B andRUCM,do not support it. It is worth mentioning that constraints specifed in USL are quite similar to constraints in UC-B. In the latter, con­straints are specifed using Event-B’s mathematical lan­guage. However, this language is rather inconvenient and diffcult for non-technical stakeholders to understand. 6.3.5 Applying USL in practice It is possible to apply USL in practice for twomain reasons. First, as discussed in Sect. 5, use cases are precisely spec­ifed and represented in USL as models, which conform to a metamodel. This enables them to be automatically transformed into other software artifacts, such as textual use case descriptions, structural and behavioral models and test cases. These generated models are necessary artifacts in software development. Second, the USL tool realizes our USL approach as an Eclipse modeling project (DSL toolkit) [26]. This tool en­ables the modeler to visually create USL models and to in­tegrate these models into theexisting UML use case models and class model(the latter capturesthe domain conceptsof a system). Moreover, our DSL toolkit provides the meta-metamodel language MOF to build USL. It also enables the defnition of model transformation languages in order to realize the transformations discussed in Sect. 5. However, USL is not without limitations. The graphical concrete syntax of the language might be inconvenient for modelers who prefer writing use cases in the textual form. In order to accommodate for this, the USL tool would be extended with a textual editor, similar to one used in the RSL approach [17]. This textual editor would enable a modeler to specify use cases by entering descriptive sen­tences about actions in steps, constraints, and relations be­tween steps. The tool would then process these to create the corresponding USL model. 7 Related work We position our work in the intersection between use case-driven development [1] and model-driven develop­ment [18].Within this context,a use case modelis usually represented as a combination of a UML use case diagram and a textual description written in natural language. Such ause case specifcation tends to be ambiguous, unclear,and inconsistent. In order to precisely specify use cases several approaches as in [10], [4], [16], [17], [3] have been pro­posed. T.Yueet al. [3] proposed a use case modeling language called Restricted Use Case Modeling (RUCM), which is composedofa use case description template,a setofkey-word, and a set of well-defned restrictions for a restricted natural languageto specify use cases.However,theRUCM is semi-formal textual language and it does not mention some important information such as concurrent actions, the pre-and postcondition of actions. Hence, in other work that use RUCM to express use case specifcations to automatically generate other artifacts, they have to use NLP(Natural Language Processing) techniquetoextractin­formation.For example, C.Wang et al. [30] uses use case specifcationsexpressedinRUCMin orderto generate test cases. AfteruseNLP techniquetoextracttest scenariosand constraints described in natural language, they use OCL to precisely specify constraints and use these precise specif­cations to automatically generate test data. R. Murali et al. [10] proposed using a mathematical lan­guage w.r.t. Event-B in order to formalize the pre-and postcondition of triggers and actions within use case fows. However, other descriptions of a use case are still infor­mal. Their proposition only focus automatically generates acorresponding Event-B model that is then amenable to the Rodin verifcation tools that enable system-level properties to be verifed. M. Misbhauddin et al. [4] extended the metamodel of UML use case models in order to capture both the structural and behavioural aspectsof use cases.To specifya use case, theydevelopeda prototypetool called UCDest. However, concurrent actions, pre-and postcondition of actions have not been mentioned. Moreover, action types are defned inadequately. D. Savic´et al. [16] and M. Smialek et al. [17] proposed the DSLs named SilabReq and RSL in order to capture use cases as the functional requirements models. The DSLs only focus on fows describing use case scenarios while other description information of use case is omitted. In ad­dition, the RSL does not defne distinguish actions insert­ing an extending use case and an included use case, both are defned action. Furthermore, the DSLs do not mention concurrent actions, pre and postcondition of actions. In comparison with all thework above,We provide for USLa formal semantic which useLTS toexpress, while other works lack a formal semantics. Our previous work in [36, 9] proposed a metamodel to specifyuse cases.Inthatworkwealsotriedtodefneapre­cise semantics for use cases based on graph transformation. Ourworkhere continuesitbyenhancingtheuse case meta-modelas wellas proposinganewLTS-based techniquein order to characterize the operational semantics of use case. Furthermore, all above mentioned approaches still lack a method specifying use cases satisfying all relevant infor­mation of use cases including fows, steps, system actions, actor actions, control fows, relationships, and constraints on the use case and its fows. The USL language, introduced in this work, aims to cover all relevant information of a use case including both structural and behavioural aspect. Comparing to the cur-rentworksin literature,USL could obtainthe followingad-vantages: (1) to specify concurrent actions in fows; (2) to capture and represent nine action types in which there are the system action including another use case and the sys­tem action extending another use case that have not been mentioned in other research; (3) to present not only con­straints on the use case and its fowsbut pre-and postcon­dition of each action in fows; (4) to present control fows of steps within the use case. In addition, in this paper we also defned operational semantics of USL to specify dy­namic information when use case scenariosexecute. In that way, from USL models we could obtain software artifacts by transformations. 8 Conclusion This paper proposed a DSL named USL to specify use cases. A USL model can cover the relevant information of a use case description including fows, steps, system ac­tions, actor actions, relationships, control fows, and con-straints.Webuilt the abstract usinga metamodel together with wellformedness rules and the graphical concrete syn­tax of USL. Moreover, we defned precise semantic for the USLbymappingUSL modelstoLTSs.Wealsodeveloped a USL Editor to create the USL models visually. In ad­dition, we explained how USL models can be transformed to some software artifacts and developed a model transfor­mation program to automatically generate textual template-based use case descriptions. Moreover,weevaluated USL’s expressiveness. In the future work, we will focus on realizing transfor­mations from USL models in order to generate test cases as well as other software artifacts automatically. In addition, we will enrich the abstract syntax and enhance the concrete syntax of USL in order to support better for modelers. Acknowledgement We wish to thank the anonymous reviewers for their useful comments. References [1] I. Jacobson, Object-Oriented Software Engineering: AUse Case Driven Approach, AddisonWesleyLong-man Publishing Co., Inc., 2004. [2] OMG, “UML 2.5,” May 2005. [3]T.Yue, L.C. Briand, andY. Labiche,“Facilitating the Transition from Use Case Models to Analysis Mod­els: Approach and Experiments,”ACMTrans. Softw. Eng. Methodol., vol.22, no.1, pp.5:1–5:38, March 2013. [4] M. Misbhauddin and M. Alshayeb, “Extending the UML Use Case Metamodel with Behavioral Informa­tion to Facilitate Model Analysis and Interchange,” Software&Systems Modeling,vol.14, no.2,May. [5] P. Kruchten, The Rational Unifed Process: An Intro-duction,3ed., Addison-WesleyProfessional, 2004. [6] D. Liu, K. Subramaniam, B.H. Far, and A. Eber-lein,“AutomatingTransitionfrom Use-CasestoClass Model,” Proc. Canadian Conf. Electrical and Com­puter Engineering. Toward a Caring and Humane Technology (Cat. No.03CH37436)(CCECE), 2003. [7] P. Haumer, “Use case-based software development,” in Scenarios, Stories, Use Cases: Through the Sys­tems Development Life-Cycle, ed. I. Alexander and N. Maiden, ch. 12, pp.237–264,Wiley, 2004. [8] S. Tiwari and A. Gupta, “A Systematic Literature Review of Use Case Specifcations Research,” Inf. Softw.Technol.,vol.67, no.C. [9] D.H. Dang, “Triple Graph Grammars and OCL for Validating System Behavior,” Proc. 4th Int. Conf. Graph Transformations (ICGT), LNCS 5214, pp.481–483, Springer, 2008. [10] R. Murali, A. Ireland, and G. Grov, “UC-B: Use Case Modelling with Event-B,” Proc. 5th Int. Conf. Ab­stract State Machines, Alloy, B, TLA, VDM, and Z(ABZ), ed. M. Butler, K.D. Schewe, A. Mashkoor, and M. Biro, LNCS 9675, Springer, 2016. [11] W. Grieskamp and M. Lepper, “Using Use Cases in ExecutableZ,” Proc. 3th Int. Conf.Formal Engineer­ing Methods (ICFEM), pp.111–119, IEEE, 2000. [12] J.S. Thakur and A. Gupta, “Automatic Generation of Sequence Diagram from Use Case Specifcation,” Proc. 7th India Conf. Software Engineering (ISEC), pp.20:1–20:6,ACM, 2014. [13] L. Li, “Translating Use Cases to Sequence Dia­grams,” Proc. 15th Int. Conf. Automated Software Engineering (ASE), pp.293–298, IEEE Computer So­ciety, 2000. [14] J.M. Almendros-Jiménez and L. Iribarne, “Describ­ing Use Cases with Activity Charts,” Proc. Int. Conf. Metainformatics (MIS 2Generation of System Test Cases from Use Case Specifcations,”004), LNCS 3511, pp.141–159, Springer-Verlag, 2005. [15] S.Tiwari and A. Gupta, “An Approach of Generat­ing Test Requirements for Agile Software Develop­ment,” Proc. 8th on India Conf. Software Engineering (ISEC),ACM, 2015. [16] D.Savi´c,S. Vlaji´c,S. Lazarevi´c,I. Antovi´c,V. Stano­jevi´c, and A.R. da Silva, “Use Case Speci- c, M. Mili´fcation Using the SILABREQ Domain Specifc Lan­guage,” Computing and Informatics, vol.34, no.4, pp.877–910, Feb. 2016. [17]M. SmialekandW.Nowakowski,From Requirements to Java in a Snap: Model-Driven Requirements Engi­neering in Practice, Springer. [18] M. Brambilla, J. Cabot, and M. Wimmer, Model-Driven Software Engineering in Practice, 1st ed., Morgan&Claypool Publishers, 2012. [19] OMG, “OCL 2.0,” May 2006. [20] R.M. Keller, “Formal verifcation of parallel pro­grams,” Commun. ACM, vol.19, no.7, pp.371–384, July 1976. [21] M.H. Chu, D.H. Dang, N.B. Nguyen, M.D. Le, and T.H. Nguyen, “USL: Towards Precise Specifcation of Use Cases for Model-Driven Development,” Proc. 8th Int. Conf. Information and CommunicationTech­nology (SoICT), pp.401–408, 2017. [22] A. Cockburn, Writing Effective Use Cases, 1 edi­tion ed., Addison-WesleyProfessional, Boston, Oct. 2000. [23] I. Jacobson, I. Spence, and K. Bittner, USE-CASE 2.0 The Guide to Succeeding with Use Cases, Ivar Jacobson International SA., 2011. [24] M. Giese and R. Heldal, “From Informal toFormal Specifcations in UML,” Proc. Int. Conf. The Unifed Modeling Language: Modelling Languages and Ap­plications(UML),ed.T.Baar,A. Strohmeier,A.M.D. Moreira, and S.J. Mellor, LNCS 3273, 2004. [25] P. Schmitt, I. Tonin, C. Wonnemann, E. Jenn, S. Leriche, and J.J. Hunt, “A Case Study of Speci­fcation and Verifcation Using JML in an Avionics Application,” [26] R.C. Gronback, Eclipse Modeling Project: A Domain-Specifc Language (DSL) Toolkit, 1 edi­tion ed., Addison-Wesley Professional, Boston, March 2009. [27] J. Heumann, “Generating Test Cases From Use Cases,” tech. rep., Rational Software, 2001. [28] s. results and B. Legeard, Practical Model-Based Testing: A Tools Approach, 1 edition ed., Morgan Kaufmann, Amsterdam;Boston, Dec. 2006. [29] W. Linzhang, Y. Jiesong, Y. Xiaofeng, H. Jun, L. Xuandong, and Z. Guoliang, “Generating Test Cases from UML Activity Diagram Based on Gray-Box Method,” Proc. 11th Asia-Pacifc Conf. Soft­ware Engineering (APSEC), IEEE Computer Society, 2004. [30] C. Wang, F. Pastore, A. Goknil, L. Briand, and Z.Iqbal,“Automatic GenerationofSystemTestCases from Use Case Specifcations,” Proc. Int. Sympo­sium Conf. SoftwareTesting and Analysis (ISSTA), pp.385–396,ACM, 2015. [31] “Generating Test Data from OCL Constraints with SearchTechniques,” vol.39. [32] B. Regnell, M. Andersson, and J. Bergstrand, “A Hierarchical Use Case Model with Graphical Rep­resentation,” Proc. IEEE Symposium andWorkshop on Engineering of Computer-Based Systems (ECBS), pp.270–277, IEEE Computer Society, 1996. [33] G. Kotonya and I. Sommerville, Requirements En­gineering: Processes andTechniques, 1sted.,Wiley Publishing, 1998. [34] M. Genero Bocco, A. Durán Toro, and B. Bernárdez Jiménez, “Empirical Evaluation and Review of a Metrics-Based Approach for Use CaseVerifcation,” Journal of Research and Practice in InformationTechnology,vol.36, no.4, pp.247–258, 2004. [35] Russell C. Bjork, “An Example of Object-Oriented Design: An ATM Simulation.” http://www.math-cs.gordon.edu/ courses/cs211/ATMExample/. Accessed: 2018-01-01. [36] D.H. Dang, A.H.Truong, and M. Gogolla, “Check­ing the Conformance between Models Based on Sce­nario Synchronization,” Journal of Universal Com­puter Science, vol.16, no.17, pp.2293–2312, 2010. Effective Deep Multi-source Multi-task Learning Frameworksfor Smile Detection, Emotion Recognition and Gender Classifcation DinhViet Sang andLeTran Bao Cuong Hanoi Universityof Science andTechnology,1DaiCoViet, HaiBaTrung, Hanoi,Vietnam E-mail: sangdv@soict.hust.edu.vn, ltbclqd2805@gmail.com Keywords:multi-task learning, convolutional neural network, smile detection, emotion recognition, gender classifcation Received: March 29, 2018 Automatic humanfacial recognition has been an active reasearch topic withvarious potential applications. In this paper, we propose effective multi-task deep learning frameworks which can jointly learn represen­tations for three tasks: smile detection, emotion recognition and gender classifcation. In addition, our frameworks can be learned from multiple sources of data with different kinds of task-specifc class labels. Theextensiveexperiments show that our frameworks achieve superior accuracyover recent state-of-the-art methods in all of three tasks on popular benchmarks. We also show that the joint learning helps the tasks with less data considerably beneft from other tasks with richer data. Povzetek: Razvitajeizvirna metoda globokih nevronskih mrež za tri hkratne naloge: prepoznavanje smeha, ˇ custev in spola. 1 Introduction In recent years, we have witnessed a rapid boom of artif­cial intelligence (AI) in various felds such as computer vi­sion, speech recognition and natural language processing. Awide range of AI products have boosted labor productiv­ity, improved the quality of human life, and saved human and social resources. Manyartifcial intelligence applica­tionshave reached oreven surpassed humanlevelsin some cases. Automatic humanfacial recognition has become an ac­tive research area that playsakeyrolein analyzing emo­tions and human behaviors. In thiswork, we study different human facial recognition tasks including smile detection, emotion recognition and gender recognition. All of three tasks usefacial images as input. In smile detection task, we have to detect if the people appearing in a given image are smiling or not. We then classify their emotions into seven classes: angry, disgust, fear, happy, sad, surprise and neutral in emotion recognition task. Finally, we determine who are males and who are females in gender classifcation task. In general, these tasks are often solved as separate prob­lems. This may lead to manydiffculties in learning mod­els, especially, when the training data is not large enough. Ontheotherhand,thedataofdifferentfacial analysistasks often shares manycommon characteristicsof humanfaces. Therefore, joint learning from multiple sourcesofface data can boost the performance of each individual task. In this paper, we introduce effective deep convolutional neural networks (CNNs) to simultaneously learn common features for smile detection, emotion recognition and gen­der classifcation. Each task takes input data from its corre­sponding source, but all the tasks share a big part of the networks with many hidden layers. At the end of each network, these tasks are separated into three branches with different task-specifc losses. We combine all the losses to form a common network objective function, which allows us to train the networks end-to-end via the back propaga­tion algorithm. The main contributions of this paper are as follows: 1. We propose effective architectures of CNNs that can learn joint representations from different sources of data to simultaneously perform smile detection, emo­tion recognition and gender classifcation. 2. We conduct extensive experiments and achieve new state-of-the-art accuracies in different tasks on popu­lar benchmarks. The rest of the paper is organized as follows. In sec­tion 2, we briefy review related work. In section 3, we present our proposed multi-task deep learning frameworks and describe how to train the networks from multiple data sources. Finally,in section4,weshowtheexperimental re­sults on popular datasets and compare our proposed frame­works with recent state-of-the-art methods. 2 Related work 2.1 Deep convolutional neural networks In recent years, deep learning has been proven to be ef­fective in manyfelds, and particularly, in computer vision. DeepCNNsareoneofthemostpopular modelsinthefam­ily of deep neural networks. LeNet [21], and AlexNet [20] are known to be the earliest CNN architectures with not manyhidden layers. Latest CNNs such asVGG [33], Inception [35], ResNet [13] and DenseNet [16] tend to be deeper and deeper. In ResNet, residual blocks can be stacked on top of each other with over 1000 layers. Meanwhile, some other CNN ar­chitectures like WideResNet [41] or ResNeXt [40] tend to be wider. All these effective CNNs have demonstrated their impressive performances in one of the biggest and the most prestigious competitions in computer vision -the an­nual ImageNet Large ScaleVisual Recognition Challenge (ILSVRC). 2.2 Smile detection Traditional methods often detect smile based on a strong binary classifer with low-level face descriptors. Shan et al. [32] propose a simple method that uses the intensity differences between pixels in the gray-scalefacial images and then combines them with AdaBoost classifer [39] for smile detection. In order to representfaces, Liu et al. [23] use histograms of oriented gradients (HOG) [10], mean­while, An et al. [4] use local binary pattern (LBP) [3], lo­cal phase quantization (LPQ) [25] and HOG. Both of them [23, 4] then apply SVM classifer [9] to detect smiles. Jain et al. [18] propose to use Multi-scale Gaussian Derivatives (MGD) and SVM classifer as well for smile detection. Some recent methods focus on applying deep neural net­works to smile detection. Chen at al. [6] use deep CNNs to extract high-level features fromfacial images and then use SVM or AdaBoost classifers to detect smiles as a classif­cationtask. Zhangetal. [42] introducetwoeffcientCNN models called CNN-Basic and CNN 2-Loss. The CNN­2Loss is a improved variant of the CNN-Basic, that tries to learn features by using two supervisory signals. The frst one is recognition signal that is responsible for the clas­sifcation task. The second one is expression verifcation signal, whichiseffectiveto reducethevariationof features which are extracted from the images of the same expres­sion class. [30] proposes an effectiveVGG-like network, called BKNet, to detect smiles. BKNet achieves better re­sults than manyother state-of-the-art methods in smile de­tection. 2.3 Emotion recognition Classical approaches to facial expression recognition are often based onFacial Action Coding System(FACS) [11]. FACS includes alistof Action Units(AUs) that describe variousfacial musclemovements causing changesinfacial appearance. Cootes et al. [38] propose a model based on an approach called the Active Appearance Model [8] that createsover 500facial landmarks. Next, the authors apply PCA algorithm to the set of landmarks and derive Action Units (AUs). Finally, a single layered neural network is usedto classifyfacialexpressions. In Kagglefacialexpression recognition competition [1], the winning team [36] proposes an effective CNN, which uses the multi-class SVM loss instead of the usual cross-entropy loss. In [31], Sang et al. propose the so-called BKNet architecture for emotion recognition and achieve better performance compared to previous methods. 2.4 Gender classifcation Conventional methods for gender classifcation often take image intensities as input features. [26] combines the 3D structure of the head with image intensities. [15] uses im­age intensities combined with SVM classifer. [5] tries to use AdaBoost instead of SVM classifer. [12] introduces a neural network trained on a small set of facial images. [37] uses theWebers Local texture Descriptor [7] for gen­der classifcation. More recently, Levi et al. [22] present an effective CNN architecture that yieldsfairly good per­formance in gender classifcation. 2.5 Multi-task learning Multi-task learning aims to solve multiple classifcation tasks at the same time by learning them jointly, while ex­ploiting the commonalities and differences across the tasks. Recently, Kaiser et al. [19] propose a big model to learn simultaneously many tasks in nature language processing and computer vision and achieve promising results. Rothe et al. [28] propose a multi-task learning model to jointly learn age and gender classifcation from images. Zhang et al. [2] propose a cascaded architecture with three stages of carefully designed deep convolutional networks to jointly detectfaces and predict landmark locations. Ranjan et al. [27] introducea multi-task learning framework calledhy­perface forface detection, landmark localization, pose es­timation, and gender recognition. Nevertheless, thehyper-face is only trained from a unique source of data with full annotations for all tasks. 3 Our proposed frameworks 3.1 Overall architecture In this work, we propose effective deep CNNs that can learn joint representations from multiple data sources to solve different tasks at the same time. The merged dataset (Fig. 1) is fed into a block called “CNN Shared Network", whichcanbe designedbyusingan arbitraryCNN architec­ture such asVGG [33], ResNet [13] and so on. The moti­vation of the CNN Shared Network is to help the networks learn the shared features from multiple datasets across dif­ferent tasks. It is thought that the features learned in the shared block can generalize better and make more accurate predictions than a single-task model. Moreover, thanks to joint representation learning, the tasks with less data can largely beneft from other tasks with more data. After the shared block, each network is separated into three branches associated with three different tasks. Each branch learns task-specifc features and has its own loss function corresponding to each task. 3.2 Multi-task BKNet Our frst multi-task deep learning framework called Multi-task BKNet has been previously described in [29] (Fig. 3), which is based on the BKNet architecture [30, 31]. We construct the CNN shared networkbyeliminating three last fully-connected layers of BKNet (Fig. 2). CNN Shared Network. In this part, we use four con-volutional (conv) blocks. The frst conv block includes two convlayers with 32 neurons 3×3 with the stride1, followed by a max pooling layer 2 × 2 with the stride 2. The second conv block includes two conv layers with 64 neurons 3 × 3 with the stride 1, followed by a max pooling layer 2 × 2 with the stride 2. The third conv block includes two conv layers with 128 neurons 3 × 3 with the stride 1, followed by a max pooling layer 2 × 2 with the stride 2. Finally, the last conv block includes three conv layers with 256 neu­rons 3 × 3 with the stride 1, followed by a max pooling layer 2 × 2 with the stride 2. Each conv layer is followed by a Batch normalization layer [17] and a ReLU (Recti­fed Linear Unit) activation function [24]. The Batch nor­malization layer reduces the internal covariant shift, and, hence, allows us to use higher learning rate when applying the SGD algorithm to accelerate the training process. Branch Network. After the CNN shared network, we split the network into three branches corresponding to sep­arate tasks, i.e., smile detection, emotion recognition and gender classifcation. While the CNN shared network can learn joint representations across three tasks from multiple datasets, each branch tries to learn individual features cor­responding to each specifc task. Each branch consists of two fully connected layers with 256 neurons and a fnal fully connected layer with C neu­rons, where C is the number of classes in each task(C =2 for smile detection and gender classifcation branch, and C =7 for emotion recognition branch). Note that, after the last fully connected layer, we can either use an additional softmax layer asa classifer or not, depending on what kind of loss function is being used. These kinds of loss function are described in detail in the next section. Similar with the CNN shared network, each fully connected layer in all branches (except the last one) is followed by a Batch Nor­malization layer and ReLU. Dropout [34] is also utilized in all fully connected layers to reduce overftting. 3.3 Multi-task ResNet ResNet [13] is known as one of the most effcient CNN architectures so far. In order to enhance the information fow between layers, ResNet uses shortcut connections be­tween layers. The original variant of ResNet is proposed by He et al. in [13] with different numbers of hidden lay­ers: ResNet-18, ResNet-34 or ResNet-50, ResNet-101 and ResNet-152. He et al. then introduce an improved variant of ResNet (called ResNet_v2) in [14] which shows that the pre-activation order “conv -batch normalization -ReLU" is consistently better then post-activation order “batch nor­malization -ReLU -conv". Inspireby the design conceptof ResNet_v2, we propose a multi-task ResNet framework to jointly learn three tasks: smile detection, emotion recognition and gender classif­cation. Since the amount of facial data is not large, we choose ResNet-50 (with bottleneck layer) as the base ar­chitecture to design our multi-task ResNet framework. In the original ResNet_v2-50 architecture, there are4 resid­ual blocks, each of which consists of some sub-sampling blocks and identity blocks. The architectures of identity blocks and sub-sampling blocks are shown in Fig. 4a and Fig. 4b. For both these two kinds of blocks, we use the bottleneck architecture with base depth m that consists of three conv layers: a 1×1 conv layer with m flters followed by a 3 × 3 conv layer with m flters and a 1 × 1 conv lay­ers with 4m flters. The identity blocks and sub-sampling blocks are distinguished by the stride value in the second conv layer and the shortcut connection. In sub-sampling blocks, we usea conv layer with stride2insteadof stride 1as in identity blocks. The frst residual block of ResNet­50 contains only3identity blocks and has no sub-sampling block. The next three residual blocks of ResNet-50 have a sub-sampling block at the top, followed by 3, 5 and 2 identity blocks, respectively. Based on the aforementioned ResNet_v2-50 architec­ture, we propose two versions of multi-task ResNet frame­work. In the frst version, which is abbreviated as Multi-task ResNetver1,we useallof4residual blockstobuild the CNN shared network to learn joint representations for three tasks. Like in multi-task BKNet, for each task in branch network, we use two fully connected layers with 256 neurons combined with a softmax classifer. Fig. 5a illustrates the architecture of Multi-task ResNet ver1. In the second version, which is abbreviated as Multi-task ResNet ver2, we only use frst three residual blocks to build the CNN shared network. For each task in the branch network, we useaseparate residual block combined with global average pooling layer and a softmax classifer. Fig. 5b illustrates the architecture of Multi-task ResNet ver2. 3.4 Multi-source multi-task training In this paper, we propose effective deep networks that can learn to perform multi tasks from different data sources. All data sources are mixed together and form a large com­mon training set (Fig. 1). Generally, each sample in the mixing training set is only related to some of the tasks. Suppose that: – T is the number of tasks(T =3 in this paper); – Lt is the individual loss corresponding to the tth task, t =1, 2, ..., T . Figure 1: Merged dataset Figure 2: The CNN shared network in Multi-task BKNet is just the top part (marked by red lines) of the BKNet archi­tecture [30],excluding the last three fully-connected layers. – N is the number of samples from all training datasets; – Ct is the number of classes corresponding to the tth task(C1 = C3 =2 for smile detection and gender classifcation task, C2 =7 for emotion recognition task); –st is the vector of class scores corresponding to i-th i sample in tth task; – lt is the correct class label of i-th sample in tth task; i –yt is the one-hot encoding of the correct class label of i t i-th sample in tth task(y(lt )=1); ii Figure 3: Our proposed Multi-task BKNet t – by isthe probabilitydistributionoverthe classesof i-th i sample in tth task, which canbe obtainedby applying the softmax function to sti. – .t .{0, 1} is the sampletype indicator(.t =1 if ii the ith sample is related to the tth task, and .t =0 i otherwise). Note that, if the ith sample is not related to tth task, then the true label does not exist, and we can ignore lit and yti. To ensure the mathematical correctness in this case, we can set them to arbitrary values, for instance, lt =0 and yt is a ii X (a) Identity block (b) Subsampling block X Figure 4: The architectures of identity blocks and sub-sampling blocks in our Multi-task ResNet framework. X zero vector. where st (j) indicates the score of class j in the i-th sample; i In this paper, we try two kinds of loss: soft-max cross st (lt ) defnes the score of true label lt in the i-th sample. ii i entropy or multi-class SVM loss. The total loss of the network is computed as the weighted The cross-entropy loss requires to use a softmax layer sum of the three individual losses. In addition, we also add after the last fully-connected layer of each branch. The L2 weight decay term associated with all network weights cross-entropyloss Lt corresponding to tth task is defned Wto the total network loss to reduce overftting. The over-as follows: all loss can be defned as follows: NCtT . . 1 t ..t yti(j)log(by i(j)) i . , (1) Ltotal = µtLt + .kWk22, (3) Lt = - N i=1 j=1 1 where µt is the importance level of the tth task in the over- where yt (j) .{0, 1} indicates whether j is the correct i t label of i-th sample; by i(j) . [0, 1] expresses the probability all loss; . is the weight decay coeffcient. X X We train the network end-to-end via the standard back that j is the correct label of i-th sample. propagation algorithm. The multi-class SVM loss function is used when the last fully connected layer in each task-specifc branch accom­panies with no activation function. The multi-class SVM 3.5 Data pre-processing loss function corresponding to the tth task can be defned as follows: All the images from the datasets that we use later are por- NCt . . traits. Nevertheless, our networksworks withfacial regions only. Thus, we have to perform data pre-processing to crop Lt = 1 N ... .t max(0, sti(j) - sit (lit ) + 1)2 i ... , faces from the original images in the datasets. Here we i=1 j=1 use Multi-task Cascaded Convolutional Neural Networks =lt i j6(MTCNN) [2] to detectfaces in each image. Fig.6shows (2) someexamplesof using MTCNN for croppingfaces. Figure 5: Our proposed Multi-task ResNet framework. The notation “Identity block, m" means the identity block with base depth m. After that, the cropped images are converted to grayscale and resized to 48 × 48 ones. Figure 6: MTCNN for face detection. The top row is original images. The bottom row are croppedfaces using MTCNN. 3.6 Data augmentation Due to small amount of samples in the dataset, we use data augmentation techniques to generate more new data for the training phase. These techniques help us to reduce overft-ting and, hence, to learn more robust networks. We used three following popular ways for data augmen­tation: -Randomly crop: We add margins to each image in the datasetsandthencropa random areaofthatimagewiththe same size as the original image; -Randomly fip an image from left to right; -Randomly rotate an image by a random angle from -15. to 15. . The space around the rotated image is then flled with black color. In practice, we fnd that applying augmentation tech­niques greatly improves the performance of the model. 4 Experiments and evaluation 4.1 Datasets 4.1.1 GENKI-4K dataset GENKI-4Kisa well-known dataset usedin smile detection task. This dataset includes 4000 labelled images of human face from different ages, and races. Among these pictures, 2162 images were labeled as smile and 1838 images were labeled as non-smile. The images in this dataset are taken from the internet with different real-world contexts (unlike otherface datasets, often taken in the same scene), which makes the detection more challenging. However, some im­ages in the dataset are unclear (not sure whether smile or not). In some previous works, some unclear images are eliminated during the training and testing phases. It is obvi­ously thatkeeping wrong samplesin the dataset intuitively makes the model more likely to be confused during the training phase. In the testing phase, the wrong samples might considerably reduce the overall accuracy, when the model makes true predictionsbut the data says no. Despite thatfact,in thiswork we still retain all the imagesin the original datasetin both phases. Fig.7shows someexam­ples from GENKI-4K dataset. Figure 7: Some samples in the GENKI-4K dataset. The top tworows areexamplesof smilefacesandthe bottomtwo rows areexamplesof non-smilefaces. 4.1.2 FERC-2013 dataset FERC-2013 dataset is provided on the Kaggle facial ex­pression competition. The dataset consists of 35,887 gray images of 48x48 resolution. Kaggle has divided into 28,709 training images, 3589 public test images and 3589 private test images. Each image containsa humanface that is not posed (in the wild). Each image is labeled by one of seven emotions: angry, disgust, fear, happy, sad, surprise and neutral. Some images of the FERC-2013 dataset are showed in Fig. 8. 4.1.3 IMDB andWiki dataset In this work, we use IMDB and Wiki datasets as data sources for gender classifcation task. The IMDB datasetisa largeface dataset that includes data from celebrities. The authors take the list of the most popular 100,000 actors as listed on the IMDB web­site and (automatically) crawl from their profles date of Figure 8: Some samples in the FERC-2013 dataset. birth, name, gender and all images related to that per­son. The IMDB dataset contains about 470.000 images. In this paper, we only use 170.000 images from IMBD. The Wiki dataset also includes datafrom celebrities, which are crawled data fromWikipedia. TheWiki dataset contains about 62.000 images and in this work we will use about 34.000 images from this dataset. Fig.9shows some sam­ples from IMDB andWiki datasets. Figure9: Some samplesin the IMDB andWiki datasets. 4.2 Implementation detail In the experiments, we use GENKI-4K dataset for smile detection, FERC-2013 for emotion recognition. We sepa­ratelyuseoneofthetwoIMDBandWiki datasetsforgen­der classifcation task. Our experiments are conducted using Python programing-language on computers with the follow­ing specifcations: Intel Xeon E5-2650 v2 Eight-Core Processor 2.6GHz 8.0GT/s 20MB, Ubuntu Operating System 14.04 64 bit, 32GB RAM, GPU NVIDIA TITAN X12GB. Preparing data: Firstly, we merge three datasets (GENKI-4K, FERC-2013, gender dataset IMDB/Wiki) to makealarge dataset.Wethen createamarkervectortode-fne sample type indicators .t .We alwayskeep the num- i ber of training data for each task equally to help the learn­ing process stability. For example, if we train our model with two datasets: datasetAwith 3000 samples, datasetB with 30000 samples, we will duplicate datasetA 10 times to make a big dataset with total 60000 samples. In our work, we divide each dataset into training set and testing set.With GENKI-4K dataset, we use 3000 samples for training and 1000 samples for testing.With FERC-2013 dataset we use data split as providedby Kaggle.WithWiki dataset, we use 30000 samplesfor training and about 4200 samples for testing. With IMDB dataset, we use 150000 samples for training and about 20000 samples for testing. Training phase: With Multi-task BKNet architecture, our model is trained end-to-end by using SGD algorithm with momentum 0.9. We set the batch size equal to 128. We initialize all weights usinga Gaussian distribution with zero mean and standard deviation 0.01. The L2 weight de­cay is . =0.01. All the tasks have the same importance level µ1 = µ2 = µ3 =1. The dropout rate for all fully connected layers is set to 0.5. Moreover, we apply an ex­ponential decay function to decay the learning rate through time. The learning rate at step k is calculated as follows: curLr = initLr * decayRatem/decayStep , (4) where curLr is the learning rate at step m; initLr is the initialization learning rate at the beginning of training phase; decayStep is the number of steps when the learning rate decayed. In our experiment, we set initLr =0.01, decayRate = 0.8 and decayStep = 10000. We train our Multi-task BKNet model in 250 epochs. Similar to Multi-task BKNet, we train our Multi-task ResNet end-to-end by using SGD algorithm with momen­tum 0.9. We set the batch size equal to 128. We initial-izeall weightsusingvariance scaling initializer(He initial­izer). The L2 weight decay is 10-4. All the tasks have the same important level µ1 = µ2 = µ3 =1. We train the Multi-task ResNet ver1 in 100 epochs and train the Multi-task ResNet ver2 in 80 epochs. The initial learning rate is 0.05 and then decreased by 10 times whenever the training loss stops improving. Testing phase: In the testing phase, our model is eval­uated by k-fold cross-validation algorithm. This method splits our original data into k parts of the same size. The modelevaluationis performedthroughloops,eachloopse­lects k - 1 parts of data as training data and the rest is used for testing model.For the convenienceof doing com­parison between different methods, we use 4-fold cross-validation algorithm as previousworks.We will report the average accuracyand the standard deviation after4 itera­tions. Moreover, we test our model with two different loss functions mentioned above. Furthermore, we combine different checkpoints obtained during the training phases to infer test samples. In the pa­per, wekeep 10 last checkpoints corresponding to 10 last training epochs for inference. 4.3 Experimental results 4.3.1 Multi-task BKNet In this work, we set up two experiment cases. Firstly, we train our model with GENKI-4K, FERC-2013 and Wiki dataset. Secondly, we train our model with GENKI-4K, FERC-2013 and IMDB dataset.Table1shows ourexperi­ment setup. We report our results and compare with previous meth­ods in Table 2. As we can see, using cross-entropy loss function gives better result than using SVM loss function in all cases. In smile detection task, the best accuracy we achieve is 96.23 ± 0.58% when we train our model with GENKI-4K, FERC-2013 and IMDB dataset. In allexperiment cases, we achieve better results than previous state-of-the-art meth­ods. Especially, the Multi-task BKNet clearly outperforms the single-task BKNet [30]. Thisfact proves that the smile detection task largely benefts from other tasks thanks to sharing the commonalities between data. In emotion recognition task, the best accuracy we achieve is 71.03 ± 0.11% for public test and 72.18 ± 0.23% for private test. This result consider­ably outperforms all of previous methods. In gender classifcation task, to the best of our knowl­edge, there are no previous results on theWiki and IMDB datasets for gender classifcation. In this paper, we apply the single-task BKNet model [30] and achieve the accu­racy 95.82 ± 0.44% and 91.17 ± 0.27% on theWiki and IMDB datasets, respectively. The best accuracy we get on Wiki is 96.33 ± 0.16% when we train our Multi-task BKNet model onWiki. The best accuracywe get on IMDB is 92.20 ± 0.11% when we train our model on IMDB.We also report the test accuracy on IMDB when we train the model on Wiki, and the test accuracy on Wiki when we train the model on IMDB. In all tasks, the Multi-task BKNet yields comparative re­sults and even better than the single-task BKNet in many cases. Furthermore, it should be emphasized that the Multi-task network can effectively solve all the three tasks by using only a common network instead of three sepa­rate ones, which would requires approximately three times more memory storage and computational complexity. 4.3.2 Multi-task ResNet Basedontheexperimental resultsof Multi-task BKNet,we will choosethe best confgB4inTable1 toevaluate our Table 1: Experiment setup Name Datasets Loss function Use ensemble? Confg A1 GENKI-4K, FERC-2013, IMDB SVM loss No Confg A2 GENKI-4K, FERC-2013, IMDB Cross-entropyloss No Confg A3 GENKI-4K, FERC-2013, IMDB SVM loss Yes Confg A4 GENKI-4K, FERC-2013, IMDB Cross-entropyloss Yes Confg B1 GENKI-4K, FERC-2013, Wiki SVM loss No Confg B2 GENKI-4K, FERC-2013, Wiki Cross-entropyloss No Confg B3 GENKI-4K, FERC-2013, Wiki SVM loss Yes Confg B4 GENKI-4K, FERC-2013, Wiki Cross-entropyloss Yes Table 2: Accuracycomparison on four datasets Method GENKI-4K FERC-2013 Wiki IMDB Public test Private test Chen et al [6] 91.8 ± 0.95 - - - - CNN Basic [42] 93.6 ± 0.47 - - - - CNN 2-Loss [42] 94.6 ± 0.29 - - - - Single-task BKNet +Softmax [30] 95.08 ± 0.29 - - 95.82 ± 0.44* 91.16 ± 0.27* CNN (team Maxim Milakov ­rank 3 Kaggle) - 68.2 68.8 - - CNN (team Unsupervised ­rank 2 Kaggle) - 69.1 69.3 - - CNN+SVM Loss (team RBM)[36] - 69.4 71.2 - - Single-task BKNet +SVM loss [31] - 71.0 71.9 - - Our Multi-task BKNet (Confg A1) 95.25 ± 0.43 68.10 ± 0.14 69.10 ± 0.57 93.33 ± 0.19 89.60 ± 0.22 Our Multi-task BKNet (Confg A2) 95.56 ± 0.66 68.47 ± 0.33 69.40 ± 0.21 93.67 ± 0.26 90.50 ± 0.24 Our Multi-task BKNet (Confg A3) 95.60 ± 0.41 70.43 ± 0.19 71.90 ± 0.36 93.70 ± 0.37 91.33 ± 0.42 Our Multi-task BKNet (Confg A4) 96.23±0.58 70.15 ± 0.19 71.62 ± 0.39 94.00 ± 0.24 92.20±0.11 Our Multi-task BKNet (Confg B1) 95.25 ± 0.44 68.60 ± 0.27 69.28 ± 0.41 95.25 ± 0.15 88.18 ± 0.26 Our Multi-task BKNet (Confg B2) 95.13 ± 0.20 69.12 ± 0.18 69.40 ± 0.22 95.75 ± 0.18 88.68 ± 0.15 Our Multi-task BKNet (Confg B3) 95.52 ± 0.37 70.63 ± 0.11 71.78 ± 0.08 95.95 ± 0.15 88.83 ± 0.18 Our Multi-task BKNet (Confg B4) 95.70 ± 0.25 71.03±0.11 72.18±0.23 96.33±0.16 89.34 ± 0.15 Our Multi-task ResNet ver1 (Confg B4) 95.55 ± 0.28 70.09 ± 0.13 71.55 ± 0.19 96.03 ± 0.22 89.01 ± 0.18 Our Multi-task ResNet ver2 (Confg B4) 95.30 ± 0.34 69.33 ± 0.31 71.27 ± 0.11 95.99 ± 0.14 88.88 ± 0.07 Multi-task ResNet frameworks. The results of our Multi-task ResNet are also shown in Table 2. As one can see, our frst version yields better re­sults than the second version in all three tasks. In smile detection task, the frst version of multi-task ResNet achieves 95.55 ± 0.28% accuracy, while the sec­ond version achieves 95.30 ± 0.34% accuracy. With the same confg B4, our Multi-task BKNet model achieves 95.70 ± 0.25% accuracy, which is slightly better then Multi-task ResNet. In emotion recognition task, the accuracy of the frst version of Multi-task ResNet is 70.09 ± 0.13% for pub­lic test set and 71.55 ± 0.19% for private test set. The accuracy of the second version is a little bit lower with 69.33 ± 0.31% and 71.27 ± 0.11% for public test set and private test set, respectively. In this task, both versions of Multi-task ResNet seem to clearly lose Multi-task BKNet, which obtains higher approximately 1% accuracy in each test set. In gender classifcationtask, both ourvariantsof multi-task ResNet yield pretty good results, which compete with the results of of the multi-task BKNet model. The frst variant achieves the accuracy of 96.03 ± 0.22% and 89.01 ± 0.18% for Wiki dataset and IMDB dataset, re­spectively. The second variant achieves the accuracy of 95.99 ± 0.14% for Wiki dataset and 88.88 ± 0.07% for IMDB dataset. The experiment results show that the Multi-task ResNet is slightly worse than the Multi-task BKNet in all tasks. The reason could be due to that ResNet with a pretty deep architectureandfairlylarge numberof parameters tendsto be over-complex w.r.t the mixing training data across the three tasks and leads to overftting. Meanwhile, BKNet is quite smaller than ResNet, and is capable to ft the data better. 4.3.3 Speed performance comparison between different frameworks In Table 3 and Table 4, we show the inference time and training time of three frameworks: Multi-task BKNet, Multi-task ResNet ver1 and Multi-task ResNet ver2 with ConfgB4 (fromTable1). As one can see, the Multi-task ResNet ver2 acquires thefastest convergence. Despitea little longerin training time, Multi-task BKNet is signifcantlyfaster in inference in comparison with both versions of Multi-task ResNet. Thefast inference with high accuracymake the Multi-task BKNet well suitable for real-time applications. Table 3: Comparison of inference time between different frameworks Framework Inference time per image (sec) Multi-task BKNet 0.02 Multi-task ResNet ver1 0.065 Multi-task ResNet ver2 0.071 Figure 11: Some results of our Multi-task BKNet frame­work. The blue box corresponds to females and the red box corresponds to males. 5 Conclusion In this paper, we propose effective multi-souce multi-task deep learning frameworks to jointly learn threefacial analysis tasks including smile detection, emotion recogni­tion and gender classifcation. The extensive experiments in well-known GENKI-4K, FERC-2013, Wiki, IMDB datasets show that our frameworks achieve superior accu­racy over recent state-of-the-art methods in all tasks. We also showthat the smile detection task with fewdata largely beneft from the two other tasks with richer data. In the future, we would like to exploit some new auxil­iary losses to regulate the model learning process in order to improve the performance accuracyof neural networks in various computer vision tasks. 6 Acknowledgments This researchis fundedbyHanoiUniversityof Scienceand Technology under grant number T2016-LN-08. References [1] Challengesin respresentation learning:Facialexpres­sion recognition challenge, 2013. [2] Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Pro­cessing Letters, 23(10):1499–1503, 2016. https: //doi.org/10.1109/lsp.2016.2603342. [3]T. Ahonen,A. Hadid,andM. Pietikäinen.Face recog­nition with local binary patterns. Computer vision-eccv 2004, pages 469–481, 2004. [4] L. An, S. Yang, and B. Bhanu. Effcient smile de­tection by extreme learning machine. Neurocom­puting, 149:354–363, 2015. https://doi.org/ 10.1016/j.neucom.2014.04.072. [5] S. Baluja, H. A. Rowley, et al. Boosting sex identi­fcation performance. InternationalJournal of com­puter vision,71(1):111–119, 2007. https://doi. org/10.1007/s11263-006-8910-9. [6] J. Chen, Q. Ou, Z. Chi, and H. Fu. Smile de­tection in the wild with deep convolutional neu­ral networks. Machine vision and applications, 28(1-2):173–183, 2017. https://doi.org/10. 1007/s00138-016-0817-z. [7] J. Chen, S. Shan, C. He, G. Zhao, M. Pietikainen, X. Chen, and W. Gao. Wld: A robust local image descriptor. IEEE transactions on pattern analysis and machine intelligence, 32(9):1705–1720, 2010. https://doi.org/10.1109/tpami. 2009.155. Table 4: Comparison of training time between different frameworks Framework Number of epochs Training time per epoch (min) Total training time (min) Multi-task BKNet 250 3.42 854 Multi-task ResNet ver1 100 8.12 817 Multi-task ResNet ver2 80 8.67 693 [8]T.F. Cootes,C.J.Taylor,etal. Statistical modelsof appearance for computer vision, 2004. [9] C. Cortes and V. Vapnik. Support vector machine. Machine learning, 20(3):273–297, 1995. [10] O. Déniz, G. Bueno, J. Salido, and F. De la Torre. Face recognition using histograms of oriented gra­dients. Pattern Recognition Letters, 32(12):1598– 1603, 2011. https://doi.org/10.1016/j. patrec.2011.01.004. [11] P. Ekman and E. L. Rosenberg. What the face reveals: Basic and applied studies of sponta­neous expression using the Facial Action Coding System (FACS). Oxford University Press, USA, 1997. https://doi.org/10.1093/acprof: oso/9780195179644.001.0001. [12] B. A. Golomb, D.T. Lawrence, andT. J. Sejnowski. Sexnet: Aneural network identifes sex from human faces. In NIPS, volume 1, page 2, 1990. [13] K. He, X. Zhang, S. Ren, and J. Sun. Deep resid­ual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pat­tern recognition, pages 770–778, 2016. https: //doi.org/10.1109/cvpr.2016.90. [14] K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in deep residual networks. In European Conference on Computer Vision, pages 630–645. Springer, 2016. https://doi.org/10.1007/ 978-3-319-46493-0_38. [15] X. He and P. Niyogi. Locality preserving projec­tions. In Advances in neural information processing systems, pages 153–160, 2004. [16] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger. Densely connected convolutional net­works. In CVPR, volume 1, page 3, 2017. https: //doi.org/10.1109/cvpr.2017.243. [17] S. Ioffe and C. Szegedy. Batch normalization: Ac­celerating deep network training by reducing internal covariate shift. In International Conference on Ma­chine Learning, pages 448–456, 2015. [18] V. Jain and J. L. Crowley. Smile detection using multi-scale gaussian derivatives. In 12th WSEAS International Conference on Signal Process­ing, Robotics andAutomation, 2013. [19] L. Kaiser, A. N. Gomez, N. Shazeer, A. Vaswani, N. Parmar, L. Jones, and J. Uszkoreit. One model to learn them all. arXiv preprint arXiv:1706.05137, 2017. [20] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Im­agenet classifcation with deep convolutional neural networks. In Advances in neural information process­ing systems, pages 1097–1105, 2012. [21] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recog­nition. Proceedings of the IEEE, 86(11):2278– 2324, 1998. https://doi.org/10.1109/5. 726791. [22] G. Levi and T. Hassner. Age and gender classif­cation using convolutional neural networks. In Pro-ceedingsofthe IEEE Conferenceon ComputerVision and Pattern Recognition Workshops, pages 34–42, 2015. https://doi.org/10.1109/cvprw. 2015.7301352. [23] M. Liu, S. Li, S. Shan, and X. Chen. Enhancing ex­pression recognition in the wild with unlabeled refer­ence data. In Asian Conference on ComputerVision, pages 577–588. Springer, 2012. https://doi. org/10.1007/978-3-642-37444-9_45. [24] V. Nair and G. E. Hinton. Rectifed linear units im­prove restricted boltzmann machines. In Proceed­ings of the 27th international conference on machine learning (ICML-10), pages 807–814, 2010. [25] V. Ojansivu and J. Heikkilä. Blur insensitive texture classifcation using local phase quantization. In Inter­national conference on image and signal processing, pages 236–243. Springer, 2008. https://doi. org/10.1007/978-3-540-69905-7_27. [26] A. J. O’toole, T. Vetter, N. F. Troje, and H. H. Bthoff. Sex classifcation is better with three-dimensional head structure than with image inten­sity information. Perception, 26(1):75–84, 1997. https://doi.org/10.1068/p260075. [27] R. Ranjan, V. M. Patel, and R. Chellappa. Hyper-Face:Adeep multi-task learning framework forface detection, landmark localization, pose estimation, and gender recognition. IEEETransactions onPat-tern Analysis and Machine Intelligence, pages 1–1, 2017. https://doi.org/10.1109/tpami. 2017.2781233. [28] R. Rothe, R.Timofte, and L.Van Gool. Dex: Deep expectation of apparent age from a single image. In Proceedings of the IEEE International Confer­ence on Computer Vision Workshops, pages 10–15, 2015. https://doi.org/10.1109/iccvw. 2015.41. [29] D. V. Sang, L. T. B. Cuong, and V. V. Thieu. Multi-task learning for smile detection, emotion recognition and gender classifcation. In Pro­ceedings of the Eighth International Symposium on Information and Communication Technology, Nha Trang City, Viet Nam, December 7-8, 2017, pages 340–347, 2017. https://doi.org/10.1145/ 3155133.3155207. [30] D.V. Sang,L.T.B. Cuong, andD.P. Thuan. Facial smile detection using convolutional neural networks. In The 9th International Conference on Knowledge and Systems Engineering (KSE 2017), pages 138– 143, 2017. https://doi.org/10.1109/kse. 2017.8119448. [31]D.V.Sang,N.V.Dat,andD.P. Thuan. Facialex­pression recognition using deep convolutional neu­ral networks. In The 9th International Conference on Knowledge and Systems Engineering (KSE 2017), pages 144–149, 2017. https://doi.org/10. 1109/kse.2017.8119447. [32] C. Shan. Smile detection by boosting pixel dif­ferences. IEEE transactions on image processing, 21(1):431–436, 2012. https://doi.org/10. 1109/tip.2011.2161587. [33] K. Simonyan and A. Zisserman. Very deep convo­lutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014. [34] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: a sim­ple way to prevent neural networks from overftting. Journal of machine learning research, 15(1):1929– 1958, 2014. [35] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Ra-binovich. Going deeper with convolutions. In Pro­ceedings of the IEEE conference on computer vision and pattern recognition, pages 1–9, 2015. https: //doi.org/10.1109/cvpr.2015.7298594. [36] Y. Tang. Deep learning using support vector ma­chines. CoRR, abs/1306.0239, 2, 2013. [37] I. Ullah, M. Hussain, G. Muhammad, H. Aboalsamh, G. Bebis, and A. M. Mirza. Gender recognition from face images with local wld descriptor. In Systems, Signals and Image Processing (IWSSIP), 2012 19th International Conference on, pages 417–420. IEEE, 2012. [38] H. Van Kuilenburg, M. Wiering, and M. Den Uyl. A model based method for automatic facial expres­sion recognition. In Proceedings of the 16th Euro­pean Conference on Machine Learning (ECML’05), pages 194–205. Springer, 2005. https://doi. org/10.1007/11564096_22. [39]P.Viola andM. Jones. Fast and robust classifcation using asymmetric adaboost anda detector cascade. In Advances in neural information processing systems, pages 1311–1318, 2002. [40] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He. Aggregated residual transformations for deep neural networks. In Computer Vision and Pattern Recog­nition (CVPR), 2017 IEEE Conference on, pages 5987–5995. IEEE, 2017. https://doi.org/ 10.1109/cvpr.2017.634. [41] S. Zagoruyko andN.Komodakis.Wide residual net­works. In Procedings of the British Machine Vi­sion Conference 2016. British MachineVision Asso­ciation, 2016. https://doi.org/10.5244/c. 30.87. [42] K. Zhang, Y. Huang, H. Wu, and L. Wang. Fa­cial smile detection based on deep learning fea­tures. In Pattern Recognition (ACPR), 2015 3rd IAPR Asian Conference on, pages 534–538. IEEE, 2015. https://doi.org/10.1109/acpr. 2015.7486560. Alignment-free Sequence Searching over Whole Genomes Using 3D Random Plot of Query DNASequences Da-Young Lee, Hae-SungTak, Han-Ho Kim andHwan-Gue Cho Dept. of Electrical and Computer Engineering, Pusan National University, SouthKorea E-mail: {schematique, tok33, quant1216, hgcho}@pusan.ac.kr Keywords:sequence search, sequence visualization, whole genome, 3D random plot Received: March 29, 2018 Most genomic data studies are based on sequence comparisons and searches, and comparison models based on alignment algorithms are most commonly used. This methodisvery accurate,butitis useful when the query is short in kilobytes, because it requires the quadratic time and space complexity, O(n2) where n is the length of target and query sequences. With the development of Next Generation Sequencing tech­niques, researches on whole genome sequence data of megabyte size are being actively studied, and new comparison and search methods for large-scale sequence data are needed. We propose a new alignment-free sequence comparisonand search methodtoovercomethe limitationsofthe alignment-based model.In this graphical model, the sequence searching problem in DNAstrings can be reduced to fnd some parts of geometric object withinarelatively small-scale geometric space. When comparing similaritybymodifying sequencesof similarlength,wecan confrmthatthe comparisonmodelis appropriatebyaccurately refect­ing the degree of similarity. When searching the query sequence comparison model based on 200MB sized whole genome sequence, using the compressed coordinate information, it was able to search the 10MB sequences in 22s, which is a very reduced time compared to alignment. Although it is not possible to fnd the exact position of the base pair unit as in the alignment result, it is a model that can be used as a preprocessing process to quickly search a whole genome sequence of several hundred megabytes-size. Povzetek: Na podlagi 3D vizualizacije celotnega zaporedja genoma so avtorji pokazali, da je na dolžini poizvedbe možno prilagodljivo hitro iskanje. 1 Introduction Genomic data studies are done through sequence compar­isons, mostly using a model based on an alignment algo-rithm.Forexample, Basic Local Alignment SearchTool (BLAST)[1] is the most common method to search for se­quences in a database. It divides the query sequence into three characters, fnds the matching region, and gradually widens the region to select candidates for alignment. Al-thoughitisvery useful when searching fora short queryin the whole database, since it is based on alignment, it is dif­fcult to obtain an immediate processing result in the case of a large sequence such as a megabyte-scale chromosome owing to a large increase in computational cost. When uti­lizing the actual BLAST service, it is recommended to re­duce the database search scope when the query size is of the order of megabytes, and it is often time consuming to search and provide results by mail, rather than providing it immediately. In addition, since gene recombination is different from sequence alignment based on conservation of contigu­ity between homologous segments, in order to overcome this problem, alignment-free comparison method such like word-frequencystatistics, a method of calculating distance in space defned by frequency vectors, is also actively underway[2]. Such research is also widely used as a pre-flter for processing queries of alignment-based models. In this paper, we propose a geometric-based heuristic technique that enables the rapid comparison and search of sequences in personal computers. In this regard, AMSS[3] is a model that provides shape-based similarity compari­son, assumingthatthetime seriesdataisavector sequence. Instead of focusing on individual points of time series data, the model focuses onvectors and compares similarities be­tween data using cosine similarity. This method is advan­tageous in that it is effective for amplitude and time shift­ing. In this study, we also aimed to reduce the time and space complexity by converting the genetic sequence into a geometric object such as a random plot and performing comparison and search, taking into account that the genetic sequence data is ordered sequence data. Instead of con­sidering a single separate base, as in the alignment algo­rithm, the method compares the vector generated based on the sequence of the predetermined unit only once, and it is possible to signifcantly reduce the time required for com­parison operation by visualizing a sequence search result and presenting the information more intuitively. In addi­tion, the high-speed heuristic search technique can be ap­plied to large amounts of data, and it is possible to specify the necessary precise alignment analysis. Compared to [14], we present an improved similarity computation algorithm that considers input sequences with different lengths. We show the effectiveness of the pro­posed method with experiments on searching for short query sequences on a long sequence. 2 Related work 2.1 Genome SequenceVisualization Most genetic data have a huge volume, and it is diffcult to fnd meaningful patternsinsuchdataowingtotheirregular confguration of the four bases. The visualization of se­quence information and sequence analysis information can help in forming an intuitive understanding of the genomic data and enable the effcient representation of the results. Genome visualization research focuses on twoaspects. The frst is the visualize of a large amount of genetic informa­tion in a short time and a limited space, and the second is the representation of complex information as intuitively as possible. Figure 2: The vector design of ‘H-L curve’[5] (a) and graphical representation for the DNA sequence s =‘ATGGCATGCA’ (b). The‘Worm Curve’[4,6] represents genome information ina limited space,andit assignsabinarycodetoeach base. It is plotted on a Cartesian coordinate system, and its most signifcant biggest advantage is that the curve can represent allthe informationina relativelysmallspace, despitehow little the point intersects with each other. Studies have been actively conducted using a variety of curves to intuitively represent complex information. For example, the ‘Dual-Base Curve’ (DB-Curve)[7] has been designed to visual­ize the features of a genome sequence at a glance. In this curve, the two different bases are confgured as a combina­tion,andatwo-dimensionalvectoris assigned, wherethey componentis assignedasa constant(+1)andthex compo­nents are assigned separately. In this visualized, since the curve is continuous in the positive direction of theyaxis, thereisnopointatwhichitcrosseswith itself. Obtaininga ratio of the x-coordinates of the end points can confrm the relativeexisting ratioofthetwo basesto obtainthe statisti­cal information of the sequence in an intuitive manner. In contrast, the ‘H-L curve’[5] is a method of assign-ingatwo-dimensionalvectorforthe four bases witha con­stant x component, and this curve avoids intersection with itself because different y-components are assigned. Since the progress of a DNAsequence matches one-to-one with the ‘H-L Curve,’ it has the advantage that the main differ­ence of each sequence with other sequences can be checked quickly. In addition to visualizing curves, there is a ‘Four-Color Map’[8], which assigns colors to each base and flls ar­eas proportional to the frequency of occurrence with the corresponding color, and ‘Circos’[9, 10], which visualizes the whole genome in a circular track form. ‘Circos’ rep­resents a chromosome as a piece of a circular track, and connects the interactive chromosome tracks with a curve, thereby effectively expressing the internal relation of the whole genome. Although most relational connection vi­sualization methods express only one-to-one associations, ‘Circos’ canexpress many-to-manyassociations as wellby using circular tracks. 2.2 VisualizationToolfor Genome Sequence Figure 3: 3D graphical representation of DNA sequence using Z-axis as time axis[11]. The graphical representation for the sequence ‘ATGGTGCACC’. To compensate for the drawbacks of the sequence align­ment method in terms of processing speed, a heuristic method based on visualization is utilized. By converting a large amount of text information composed of only four kinds of bases, the meaning of which is diffcult to intu­itively grasp, to geometry information, heuristic methods are able to identify the type of data through visual exam­ination to easily fnd patterns that cannot be revealed us­ing computational methods[12]. Furthermore, geometric rules found in the visible results often have a meaningful relationship with genomic analysis in the feld. Heuristic methods are especially useful when utilized for quickly cal­culating similarity or dissimilarity. For example, large-scale genomic sequence information is converted into information on a polygon domain, and the problem of fnding similarity is solved by replacing the comparison of similarity of sequences with the compari­son of image similarity[13]. By setting a direction for each base, the sequence is converted to a random plot in which the polygon area is simplifed with the k-convex hull, and the homology of two random plots is compared. Studies [14, 15] have considered the extended space up to three dimensions in the vector assignment for each base. Conse­quently, a random plot can be visualized on three dimen­sions, and the similarity can be compared by simplifying it to be close to the actual random plot. Since direct comparison is diffcult for a walk-plot ob­ject in three dimensions, a random plot is populated in a certain space around the polygon area, and the orthogonal projectionof this space on each plane (X-Y,Y-Z, and X-Z) is used to compare the degree of similarity using the over­lap area ratio. However, the comparison method based on theoverlapping areahasadrawbackinthatitdoesnottake into account the random plot present in the local area. To overcome this drawback without simplifying the random plot, the shape of the line is maintained while the shortest distance between anypoints of two random plots is calcu­lated for comparing the degree of similarity between two sequences[16]. Previously, an alignment method called ‘Four Line’ involving graphical-domain sequence alignment, rather than string alignment, was proposed[17]. By assigning the four bases to different points on theY-axis and connecting the matched points in the sequence to be subjected to alignment in the X-axis to make a visualization of the zigzag curve, the visualization result of the two sequences are compared to conduct alignment. In order to overcome the disadvantages such as loss of information and self-intersection of existing two-dimensional visualization methods, there is a study in which a DNA sequence is three-dimensionally utilized as a time axis[11]. Regardless of the information of the base to the z-axis will always increases, and by assigning vec­tors x,yaxis is increased or decreased for each base. Not only it limited to visualization, to derive the geometrical center of the curve, this time the center of this curve is im­portant information indicating the distribution of each base. In this study,asimilarity comparison modelwasdevisedby assigning vectors to each other in different ways and using the Euclidean distance and angle correlation of the distance tothestartandendpointsofthevectorthrougheight trans­form. As a result, theycould construct the similarity ma­trix, it shown that the similar species such as human and gorilla have high similarity. In this manner, visualization results can be used not only for the intuitive delivery of sequence informationbut also as an analysis target to improve the processing speed and to obtain meaningful results. In this study, by focusing on this point, we convert a whole genome sequence to a walk-plot objectin three-dimensional space,extractavector,and compare and search for the sequence with improved speed. Furthermore, by visualizing a search query sequence to­gether with the random plot of the whole genome sequence, the position and distribution of the obtained similar se­quence can be transferred in an intuitive form. Table 1: Functional Performance of Previous Research Research Plotting space dimension Supports large-scale sequence Global similarity compute Local similarity compute BLAST [1] Compact 2D [4] H-L Curve [5] Bo Liao [11] 3D Random [15] Proposed N/A 2D 2D 3D 3D 3D 4 O 4 4 O O O O X O O O O X X X X O 3 New method using 3D random plot 3.1 Sequence Searching method with 3D Random Plot Structure An overview of our algorithm framework is shown in Fig­ure 4. Generally, all types of biological sequence compar­ison exploit the sequence alignment based on a dynamic programming approach. One popular alignment algorithm is the Needlemann–Wunsch algorithm, which is widely used in molecular biology. There are many variations in sequence alignment, such as global alignment, local align­ment, and semi-global alignment. Though the alignment approach has manyadvantages, it has a critical drawback in that it involves high complexity in terms of execution-time complexity and space complexity. The complexity of the basic alignment algorithm is O(m · n) if the lengths of two input sequences are n and m. If .(n) = .(m), the complexity is quadratic: O(n2). When the size of the in­put sequence is greater than 100 megabytes, this alignment is impractical, because it requires a main memory greater than the order of gigabytes. To overcome these problems, researchers developed heuristic alignment techniques such as BLAST-like tools. Another problem in the alignment algorithm is that it is not easy to defne the score/penalty matrix to meet the manydifferent constraints in biological sequence comparison. The basic idea of our approach is that we compute the similarity of two sequences in ‘geometric random plot’ Figure 4: Space transform from sequence to 3D geometric shape. space, rather than ‘string sequence’ space. As shown in Figure 4, we frst transform the input sequences into ran­dom plot in 3D space. Then, we compare or search for a target sequence in 3D geometric object. This transformed random plot can be visualized on an appropriately sized grid, and a sequence of megabytes in size can be represented by a list of pixels much smaller than the actual number of bp. Thus, we can say that our geometric transformation is a type of approximation with visualization. The advantage of our transformation is that the global structure can be shown by hiding the biological noise embedded in the sequence. The main merit of our approach is that it is usefuland ef­fcient in comparing very long sequences. Assume that we are asked to fnd the location of a sequence that is a few megabytes in length in a whole genome longer than 100 megabytes. 3.2 Vector Allocationfor random Plot Sequence data are string information composed of {a,g,t,c}; therefore, they must be converted into graphical information for visualization. Previous 2-D visualization methods have visualized genome sequences by assigning a separate base in the positive and negative directions of each axis(xandy). This methodhasa disadvantagein that a large amount of information is lost when a base having a vector in opposite directions is continuously repeated. Fur­thermore, if the same pattern is continuously repeated, it is impossible to visualize a large volume of data in a limited space.Toovercome this disadvantage, [15] useda3Dvec­tor.Avectoris assignedto each base,buta combinationof two bases constitutes a random plot. When the two bases are coupled together with the vector in the opposite direc­tion, the representation is made three-dimensional with a z-axis to minimize the lost information. In this study, by using a 3D vector allocation model[15], we calculate the vector character of the sequence data and obtain sequence search positions to visualize the results. Table2:Vector allocation method for each 2-mer baseina genome sequence in three-dimensional geometric space 2-mer Vector 2-mer Vector AA (2,0,0) AG(GA) (1,1,0) AC(CA) (1,-1,0) AT(TA) (0, 0, -2) CC (0,-2,0) CG(GC) (0, 0, +2) CT(TC) (-1, -1, 0) GG (0,2,0) GT(TG) (-1,1,0) TT (-2,0,0) Table 2 summarizes the vector allocation method for each 2-mer.InTable2,thebasepairsATandGC arerep-resented on the z axis. The other base pairs are represented as the sum of two unit vectors for each base, as given by the WS-curve method. After the vector transition for DNAgenome data infor­mation, those vectors are visualized in three-dimensional space. The method of visualization is the same as that of two-dimensional visualization, where the sum of vector values is computed according to the order of sequences and the results are connected with a line to provide the fnal vi­sualization result.For the random plot R, the starting point is R(0) = (X0,Y0,Z0) (X0 = Y0 = Z0 = 0). Unit3d(i) is the converted value of the ith 2-mer of the unit vector. The ith point R(i)=(Xi,Yi,Zi) of the random plot is computed as follows: i X 3d3d R(i)= R(i - 1) + Unit(i)= Unit(k) (1) k=1 Figure5 shows the direction of the random plot for each 2-mer read. Since the frst 2-mer read ‘AA’ is on the x-axis (+2), it can be confrmed from fgure (a) that the positive x-axis moves from the origin O. Since the next 2-mer read is‘AT’,amovementinthe z-axisby(-2) canbe confrmed. This vector transformation rule are determined empir­ically in order to discriminate different sequences effec­tively. As Figure 6, similar sequences are likely to produce similar walk plots. In this way, the transformed random plot is visualized in an appropriate sized three-dimensional grid. The default grid size 500 × 500 × 500 is what we empirically fgured out at which this trade off between speed and correctness of comparison is well balanced for the sequences used in the experiments. In case of the short genome sequence, it can be repre­sented ina 500×500×500 grid easily. But the large size se­quence needs space normalization to visualize the random plot in limited space. When the vectors of the random plot are calculated,the points that arefarthest from the origin O(0, 0, 0) totheX,Y,andZaxes are maxx, maxy, maxz, and the view size of visualization is V , the normalized ith point R(i)=(Xi,Yi,Zi) can be expressed as: Figure 5: Movement of the random plot for each 2-mer read. (a),(b),(c)and(d)show plotsinthe formofwalksin the X-Y, X-Z, andY-Z planes in three-dimensional space. From O(0, 0, 0), the random plot proceeds in accordance with the base assigned to 2-mer. The red random plot rep­resents movement on the X-Y plane, and the blue random plot represents movement on theZaxis. VVV Regular(R(i)) = (Xi · ,Yi · ,Zi · ) maxx maxy maxz (2) This visualization model is so useful to compare the huge whole genome. Figure 6 shows advantage of this works[15].Wehave constructed the3D random plots from two whole genomes such as Human Chromosome 1 and Chimpanzee Chromosome 1. In Figure 6, red random plot represents the Human and green one represents the Chim­panzee. Red random plots are up in the positive direction of theXandY-axis than the green one. This visualization method directly make us to confrm that two genomes are quite similar and the Human chromosome has more ‘G’ and ‘A’ base compared to Chimpanzee. 3.3 Vector Extraction from Random Plot For G,a genome sequence consisting of4 DNA bases{ a, g, t,c}, ranwalk(G) represents a three-dimensional geometric object constructed by our proposed algorithm. Therefore, ranwalk(Gi) consists of a list of linked pixels as follows: Defnition 1. ranwalk(G)= The position of a ranwalk pixel is denoted Pi = (xi,yi,zi) satisfying |xi - xi+1|. 1, |yi - yi+1|. 1 and |zi - zi+1|. 1, which means two pixels Pi and Pi+1 Figure 6: Visualization result of Human and Chimpanzee chromosome 1. Red plot is constructed from Human chro­mosome1 and the green random plot is constructed from the whole genome of Chimpanzee (Pan troglodytes) chro­mosome 1. are adjacenttoeach other, sharingacommonface.Wesay Pi and Pi+1 are ‘adjacent’ if they are within a distance of 1. Figure 7: Ageometric random plot (blue dotted line) and corresponding vectors. Now, we explain how to compute the distance between two ranplot pixels obtained from two genomes Ga and Gb to be compared. Assume that we constructed two geomet­ric objects, Ra = ranplot(Ga) and Rb = ranplot(Gb). The proposed distance measure, random plot distance (Rdist), is a vector with two components .Span and .Degree. The proposed Rdist() measure has another pa­rameter, depth k. The distance between two random plot Ra and Rb at depth k is defned recursively as follows. In this defnition, Ra1 is the frst half of Ra, and Ra2 is the last half of Ra. Rb1 and Rb2 are defned in a similar manner. Thus, Ra = Ra1 Ra2 , where denotes the geometric concatenation operation. Defnition 2. Rdist(Ra,Rb,k)= Rdist(Ra1,Rb1,k + 1) + Rdist(Ra2,Rb2,k + 1) Now,weexplainhowto compute Rdist(Ra,Rb,k = 1) at the basic depth = 1 level. In Figure 7, the thick blue Figure8:Two comparison parameters{.AB,LAB }. dotted curve represents the random plot for a genome se­quence. Symbols P0(O) and P1 denote the frst and last pixel of a random plot, respectively. Pt denotes the frst t-percentile pixel. Thus, P0.5 denotes the exact middle pixel in the list of pixels generated by our transformation algo­rithm. For an intervalina randomwalk, we obtaina parameter, the length of the direction vector (P0,P1). If two random walks to be compared start with the origin (0, 0, 0),then we can obtainthe lengthsoftwodirectionvectorsfrom Ra and Rb and compute the angle difference between two vectors Pa1 and Pb1. Assume the start and end points of Ra are Pa0,Pa1, and those of Rb are Pb0,Pb1. If k =1 is, the comparison tar­ ----› - --› get is Pa0Pa1 and Pb0Pb1. If k =2, further down one step,divided into two vectors are compared both front and -----› rear vector. Therefore, the comparison target are Pa0Pa0.5 -----› ------› ------› and Pb0Pb0.5, Pa0.5Pa1.0 and Pb0.5Pb1.0. If k =3, by ap­plying the same method, it performs a comparison of eight times(2k). If the length of divided vector drops below the appro­priate length D, the recursion is aborted. In this paper, the thresholdD valueissetto100 timestheunitsize, where unitsizeisthenumberofbpperpixelwhen visualized.The Dvaluewas determinedexperimentally becauseat leastthe lengthofthevectorwas morethan100px, meaningful com­parison was possible. 3.4 Computing Similarity and Search on Random Plot Rdist referstothe similarity distancebetweenthetwovec-tors. Figure8shows thattwo parametersof .A,B,LA,B for Rdist. .A,B refers to the angle between the two vectors, and LA,B referstotheratio betweenthelengthoftwovec-tors differ and from those of the longer vector. If the two vectors have the same orientation, .A,B =0, two vectors, if the length is equal to LA,B = is 0(0 . .A,B . 180, 0 . Algorithm2 Comparison Algorithm initialize beg ‹ 0 initialize end ‹ len(Ra) initialize O ‹{0, 0, 0} initialize D ‹ threshold lenth of vector procedure SIM(beg, end : index of vector list,Ra, Rb : random plot of Ga,Gb, threshold .s, Ls) mid ‹ (end - beg)/2+ beg cnt ‹ 0 if end - beg > D then cnt+= Sim(beg, mid, Ra,Rb) cnt+= Sim(mid +1, end, Ra,Rb) else Va ‹ Ra[end] - Ra[beg] Vb ‹ Rb[end] - Rb[beg] Lena ‹ euclideanDist(O, Va) Lenb ‹ euclideanDist(O, Vb) dotProduct(Va,Vb .a,b ‹ acos() × 180 Lena×Lenb a,b ‹ abs(Lena-Lenb) max(Lena,Lenb) if .a,b . .s and a,b . Ls then return 1 else return 0 end if end if return cnt end procedure LA,B . 1). To compare and visualize the random plot in a limited space, compression is necessary, as described earlier for­mula 2. However, in the case of the reference sequence, to calculate the overall similarity of the two vectors, it main­tains the two normalized values set. One is a normalized value that is used to process the query sequence, and the other is a normalized value of the calculated original ref­erence sequence. When comparing the sequence to search whenthe useof normalizedvaluesofthe query,and visual­ization uses the original normalized value. This is because it can not be an accurate comparison due to the size differ­ence between the reference and the query, the normalized values differ. After the normalization of the reference sequence and query sequence the normalized according to the normal­ization value of the query sequence, extend the depth to a predetermined level k to proceed comparison by divid­ing a random plot as unit size. Compare all the pieces of the vector unit size extracted from the two random plot by Rdist(). When processing the results meet the pre­determined reference range, the higher the degree of simi-larity(.A,B . .s and LA,B . Ls). The ratio between the number of the unit vectors that meet the conditions and the total numberofvectoris similarity betweentwo sequences. 3.5 Reference Sequence Slot If the length of the query is long enough, the sequence in­formation is compressed at an appropriate rate during vi­sualization in a limited space. Therefore, it is possible to performinthe on-memory statebyapplyingthe same com­pression ratio when searching in the reference sequence. However, sequences with short lengths, such as the LTR sequence, are only kilo-bytes in size and remain uncom­pressed in the visualization process. In this case, vector information becomes large, and query search becomes im­possible in on-memory state. In order to compensate for this, when the length of the reference sequence differs by morethan200times,the reference sequenceisdividedinto an appropriatenumberofslotstoperformthe search.Aslot is likea window.By reducing the search rangeby multiple ofthequerylengthatacertainpointintime,themethodde-scribed above can be applied even in a case where a search is required at a low compression ratio in a large size se­quence. |ranwalk(R)|- c0 ·|ranwalk(Q)| |Slot(Q, R)| = (3) |ranwalk(Q)|· (c0 - 1) Equation3is the numberof slots created whena query and reference sequence are given. Q and R are Query and Reference sequence respectively, and len(ranwalk(X)) representsthe lengthofthe wholevector information when X sequence is expressed as a random plot. c0 is a control constant, which is the size of the space in which a vec­tor should be searched when a certain size query vector is given. In this paper, c0 is set to around 200.0. Since the query may exist at the point where the slot is divided, the boundaries of each slot are overlapped by the length of the queryvector. Figure9 shows that thevectorof the refer­ence sequence is divided into slots. Figure 9: Slot division in reference sequence vector based on the vector length of the query sequence. 4 Experiments 4.1 Dataset Preparation Actual biological sequence data were used for the search-ingexperiment,and artifcialdata wereusedtovalidatethe similarity comparison model. The biological sequences are Human chromosome1(246MB size) and the sequence of a 1M-10M size extracted from chromosome 1. Artifcial sequence data are obtained by extracting a sequence of 1­10MBlengthfromtheHuman chromosome1sequenceat a random location and inserting noise in a predetermined ratio. A number of bases with different sizes are deleted, inserted, and replaced by a ratio of 1% to 50%. The arti­fcial data information such as ratio and theb.p. size and numberofpixelsand compression ratioisshowninTables 3 and 4. ‘A1-0’ means that the artifcial data of 1M size and 0% modifed, namely it is just extracted from Human sequence, not modifed. But ‘A10-25’ means that the arti­fcial data of 10M size and 25% modifed. This modifcation rate is expressed as ’M’ (M.Rate) in Table3 and 4. ‘M’ (M.Rate) refers to the modifed ratio of the number of B.P. on origin sequence. For verifcation of the similarity comparison model, this ratewas set higher gradually as the experiment was repeated. ‘Ratio’ refers to the compression ratio of the number of B.P. and pixels of the actual sequence to be converted to a random plot.Forexample,in theTable3, since A1-1 se­quence has 1000.02K bases, and random plot size consists of 36K pixel, the compression ratio is 3.58%. ‘Sim’ means that the similarity result of origin sequence and modifed sequence and ‘Comp.t’ represents the comparison time. Table 3: Specifcation of artifcial data of 1M, 2M size ex­tracted from Human chromosome1and comparison result Sq N. M (%) Length (K bp) Plot (K px) Ratio (%) Sim. (%) Cmp.t (s) A1-0 0 1000.02 36.00 3.58 100.00 0 A1-1 1 999.93 35.79 3.58 99.59 0 A1-2 2 1000.01 36.17 3.62 99.45 0 A1-5 5 999.89 36.67 3.67 98.23 0 A1-8 8 999.97 37.74 3.77 96.06 0 A1-10 10 1000.49 38.05 3.80 91.73 0 A1-15 15 999.78 40.74 4.07 93.58 0.016 A1-20 20 1000.29 42.49 4.25 91.76 0 A1-25 25 999.92 44.2 4.42 86.14 0 A1-30 30 999.79 47.18 4.72 84.23 0.015 A1-40 40 1001.12 50.86 5.08 69.86 0.015 A1-50 50 999.47 58.36 5.84 63.53 0.016 A2-0 0 2000.04 67.09 3.35 100.00 0 A2-1 1 1999.96 66.89 3.34 98.03 0 A2-2 2 2000.15 67.27 3.36 95.85 0 A2-5 5 2000.26 68.99 3.45 94.65 0 A2-8 8 2000.2 70.4 3.52 90.5 0 A2-10 10 2000.14 69.64 3.48 91.2 0.016 A2-15 15 1999.94 70.84 3.54 85.71 0 A2-20 20 2000.18 77.56 3.88 83.62 0 A2-25 25 2000.66 79.97 4.00 72 0 A2-30 30 1999.85 89.15 4.46 73.37 0 A2-40 40 2001.5 88.54 4.42 63.34 0.016 A2-50 50 2000.62 104.11 5.20 54.91 0.016 Tables5and6are data for searching forLTR sequences that are frequently handled in real bioinformatics analysis. In the table 5, R-F-1 is the reference sequence and means chromosome1sequenceofthe Flatfsh.Inthe correspond­ing table 6, Q-F-1 is the query sequence of R-F-1 and is theLTR sequenceextracted from R-F-1. The biggest dif­ference from the artifcially generated data is that theLTR sequence is too short and thus has a low compression rate Table 4: Specifcation of artifcial data of 4M, 10M size Sq N. M (%) Length (K bp) Plot (K px) Ratio (%) Sim. (%) Cmp.t (s) A4-0 0 4000.09 42.62 1.07 100.00 0 A4-1 1 4000.18 42.69 1.07 99.3 0 A4-2 2 3999.71 42.15 1.05 98.93 0 A4-5 5 3999.51 44.13 1.10 98.18 0 A4-8 8 3999.36 44.08 1.10 96.03 0 A4-10 10 4000.1 45.95 1.15 96.27 0 A4-15 15 3999.75 45.69 1.14 94.63 0 A4-20 20 4000.23 49.33 1.23 91.33 0 A4-25 25 3999.7 49.78 1.24 90.93 0 A4-30 30 4001.21 53.79 1.34 84.36 0.016 A4-40 40 3999.59 57.16 1.43 76.82 0.015 A4-50 50 4000.14 64.1 1.60 66.87 0 A10-0 0 10000.05 65.26 0.65 100.00 0 A10-1 1 10000.03 65 0.65 98.08 0 A10-2 2 10000.13 64.81 0.65 97.29 0 A10-5 5 9999.47 66.32 0.66 96.76 0.015 A10-8 8 9999.74 68.75 0.69 95.12 0 A10-10 10 10000.71 67.93 0.68 94.9 0.015 A10-15 15 9999.97 75.13 0.75 91.18 0 A10-20 20 9998.82 74.38 0.74 90.24 0 A10-25 25 9999.4 78.34 0.78 87.68 0.016 A10-30 30 9999.24 82.29 0.82 82.49 0 A10-40 40 9999.82 87.51 0.88 78.48 0 A10-50 50 10001.48 94.45 0.94 66.47 0 in the visualized space. This is because visualization is possible in a limited space without compression. Since the reference sequences are based on the compression ratio of the query sequence, we can see that the random plot size of the reference sequence is very large relatively. Table 5: Specifcation of biological data for reference Sq N. Chr. Species Length (M bp) Plot (M px) Ratio (%) R-F-1 R-F-2 R-F-3 R-F-5 R-H-1 1 2 3 5 1 Flatfsh Flatfsh Flatfsh Flatfsh Human 19.80 20.14 22.24 23.64 246.89 19.02 19.34 21.36 22.69 236.44 95.06 96.02 96.04 95.98 95.77 Table 6: Specifcation of biological data for query Sq N. Chr. Species Length (K bp) Plot (K px) Ratio (%) Q-F-1 Q-F-2 Q-F-3 Q-F-5 Q-H-1 1 2 3 5 1 LTR 5’LTR Gypsy LTR HERV-K 0.41 1.56 4.84 8.55 9.26 0.41 1.54 4.78 6.44 8.06 100.00 98.72 98.76 75.32 87.04 Figure 11: Red random plot represents one part of Human chromosome 1, the length of which is4 MB, in terms of nucleotide bases. Green random plot represents the 30% distorted sequence of the red one, Human chromosome 1. 4.2 Experiment:Comparison Between Modifcation ratio and Similarity based proposed Model Table3andFigure12showthe resultof similarity analysis of origin extracted sequence and modifed sequences. In Table 3, ‘Sim’ means that the similarity result of origin se­quence and modifed sequence and ‘Comp.t’ represents the comparison time. As the modifcation ratio increases, the degree of similarity decreases. Thus, it can be confrmed that the similarity comparison model proposed in this study accurately refects the similarity of the sequences. In addi­tion, except for sequence generation, the time required for comparison is 0.02 seconds, which means that it can be processed at a very high speed. 4.3 Experiment:Artifcial Sequence Search over whole genome sequence Table7isthe resultof sequence searching processforex­tracted original sequence from Human chromosome1and themodifed sequences. ‘UnitB.P.’isthesizeofB.P.as aunitof search,‘UnitVector’ referstothesizeofthevec-tor to consider when comparing a time. ‘Error Dist.’ is the distance between the actual sequence position and the re­sult of search position. ‘Find.t’ shows the amount of time spent on search. The original sequence (0% modifed se­quence) search, as well as about the modifed sequence of up to 20% are also searched in a short time. The differ­ence between the actual position and the search result is relatively accurate, as the query size is less than 200 B.P. when the query size is 1M, and only about 2000 B.P. when the query is 10M. Figure 13 and 14 are the visualization Table 7: The result of sequence search for origin sequence and modifed sequencein Human chromosome1 Q Unit sz. Vec.sz error Sim. Find.t sq. (bp) (px) Dist. (%) (s) A1-0 28 11200 0 99.29 17.269 A1-5 27 10800 150 97.27 21.341 A1-10 26 10400 840 91.34 23.213 A1-20 23 9200 120 88.75 22.514 A4-0 92 36800 1160 92.81 6.537 A4-5 90 36000 160 98.41 6.896 A4-10 88 35200 1040 92.68 7.678 A4-20 80 32000 1040 86.3 9.132 A10-0 154 61600 1120 93.88 13.665 A10-5 150 60000 560 97.21 16.065 A10-10 148 59200 280 95.09 14.245 A10-20 134 53600 2020 81.95 22.241 result of search for the query sequence of 1MB, 10MB in the chromosome1of the Human. Red random plotisa vi­sualization of Human chromosome 1, and blue point is the location where the query was searched. Through the visu­alization results, we can see that a query of 1MB size was found at a relatively early stage of the reference sequence, and a query of 10MB size was at the end of the sequence. This is consistent with the position in the actual sequence, and represents a search result in a more intuitive. Figure 13: Searching result of query sequence (A1-0) in reference sequence (Human chromosome 1). Red plot rep­resents reference sequence and blue cross point represents the position of searched query sequence. 4.4 Experiment:Biological Sequence Search over whole genome sequence Table8 shows the resultsof searchinga biological query sequence in a whole genome sequence. The search for the LTR sequence (Q-F-1) extracted from the fatfsh chromo­some 1 resulted in a similarity of 85.7% within 90 B.P. of the actual query position within about 0.4 seconds of search time. On the other hand, the HER-V sequence (Q-H-1)extracted from Human chromosome1took relatively longer time, longer than 40 seconds because the length of the query sequence was short and the length of the refer­ence sequencewas long. The difference between the actual position and the search result is about 2000 B.P., which is relatively accurate considering that the length of the refer­ence sequence is more than 200M. Figures 15,16,17 and 18 visualize the fatfsh chromo­some 1,2,3,5 sequences, respectively. The red one is a vi­sualization of the whole genome of a fatfsh, and the area marked in blue is where each query was searched. Fig­ures 17 and 18 show that the marked positions are almost identical to the origin, refecting that the Q-F-3 and Q-F­5 queries are actually located within 0.5% of the fatfsh whole genome sequence. On the other hand, Figures 15 and 16 refect that the marked positions are relativelyfar away from the origin, that the positions of the Q-F-1 and Q-F-2 queries are actually located within 7% and 10% of the fatfsh whole genome sequence. Figure 19 visualizes the Human chromosome1 sequence and marks the result of searching the Q-H-1 query. It is well refected that the Q-H-1 query is actually located in the early 63% (about 155 MB.P.) of the Human sequence. Figure 20 is the re­sult of original query sequence (Q-H-1) and enlarged sub­sequence of the reference sequence (R-H-1) at searched po­sition. The similarity of the searched sequence in the ref­erence (green plot) was 78%, and it can be confrmed that the query is very similar to the query when matched with the query sequence. Table 8: The result of sequence search for biological query sequence in fatfsh and Human chromosome 1. Q sq. Unit sz. (bp) Vec.sz (px) error Dist. Sim. (%) Find.t (s) Q-F-1 Q-F-2 Q-F-3 Q-F-5 Q-H-1 1 1 1 1 1 413 1540 4780 6443 8063 90 180 960 1230 2130 85.70 72.40 69.10 75.20 78.40 0.400 1.030 0.452 2.038 41.011 5 Conclusion Most genome sequence analyses proceed through compar­ative analysis by fnding similar sequence data. Therefore, there is a need for a technique to quickly compare and search for large amounts of sequence data. The alignment techniqueisavery accurate methodto compare sequences, butitshightimeand spacecomplexityisinadequatetohan­dle large sequences. To overcome these disadvantage, we suggestanewmethodfor comparisonand fndingforMega size sequence. Converts the genome sequence as a ran­dom plot on the three-dimensional, followed by replacing the sequence comparison problem with geometric object comparison problem. As a result of experiments, similar­ity precessed by our comparison model accurately refects the modifed ratio between the modifedsequence and the original sequence. Most analytical studies based on visu­alization derive only a single result because they derive a numerical value based on the fnal result of the visualiza­tion. The search and comparison method based on the se­quence visualization proposed in this study has high value of utilization of information because all compressed partial Figure 15: Searching result of query sequence (Q-F-1) in reference sequence (R-F-1). Red plot represents reference sequence and blue cross point represents the position of searched query sequence. visualization information is used for searching sequence. It is useful in that the partial similarity of the sequence can be measured.In addition,aquery sequenceofsize1-10Mwas searched in a whole genome sequence of 200M or more, and a relatively precise position was found for the original sequenceaswellasthe modifed sequenceupto20%. Also the search time 25 seconds or less, was confrmed handled in a very improved speed compared to the alignment algo­rithm. Onthe other hand, whena sequence witha shorter kilo­byte unit length is used as a query, such as an LTR se­quence, the compression rate is lowered at the time of vi­sualization, resulting in a lower compression rate of the reference sequence, which leads to a longer search time. However, considering the length of the reference, we can confrm that the position searched is relatively accurate. The proposed alignment-free searching method is very fast and effective to fnd a long query sequence over the whole genomes whose size is more than multi-hundreds mega-bytes. It was able to compare and search the se­quence at a much improved rate than the alignment-based model by modifying the sequence data into a three-dimensional random plot object and comparing the similar­ity with the compressed information. Searching algorithm based on alignment method is popular and works good bi­ological sequence comparisonbut if the size of query and target reference is very large (more than 100 mega bases) the alignment base algorithm requires huge memory space and takes a long computation time. Though our algorithm can’tlocates the position of query sequence exactly by the DNAbase unit,but we can use this procedure as one pre­processing step to fnd query sequence. Acknowledgement This research was supported by the Collaborative Genome ProgramoftheKorea Instituteof Marine ScienceandTech­nology Promotion (KIMST) funded by the Ministry of Oceans and Fisheries (MOF) (No. 20140428). References [1] StephenF Altschul,Warren Gish,Webb Miller, Eu­gene W Myers, and David J Lipman. Basic local alignment search tool. Journal of molecular biology, 215(3):403–410, 1990. https://doi.org/10. 1016/S0022-2836(05)80360-2. [2] Susana Vinga and Jonas Almeida. Alignment-free sequence comparison—a review. Bioinformatics, 19(4):513–523, 2003. https://doi.org/10. 1093/bioinformatics/btg005. [3] Tetsuya Nakamura, Keishi Taki, Hiroki Nomiya, Kazuhiro Seki, andKuniaki Uehara. Ashape-based similarity measure for time series data with ensem­ble learning. Pattern Analysis and Applications, 16(4):535–548, 2013. https://doi.org/10. 1007/s10044-011-0262-6. [4] Milan Randi´cko, Jure Zupan, and Mar­ c, Marjan Vraˇjana Novic.ˇCompact 2-d graphical representa­tion of dna. Chemical physics letters, 373(5):558– 562, 2003. https://doi.org/10.1016/ S0009-2614(03)00639-0. [5] Yongfan Li, Guohua Huang, Bo Liao, and Zanbo Liu. H-l curve: a novel 2d graphical representation of protein sequences. MATCH-COMMUNICATIONS IN MATHEMATICAL AND IN COMPUTER CHEM­ISTRY, 61(2):519–532, 2009. https://doi. org/10.1016/j.cplett.2008.07.046. [6] Milan Randi´c. Graphical representations of dna as 2-d map. Chemical Physics Letters, 386(4):468– 471, 2004. https://doi.org/10.1016/j. cplett.2004.01.088. [7] Yonghui Wu, Alan Wee-Chung Liew, Hong Yan, and Mengsu Yang. Db-curve: a novel 2d method of dna sequence visualization and repre­sentation. Chemical Physics Letters, 367(1):170– 176, 2003. https://doi.org/10.1016/ S0009-2614(02)01684-6. [8] Milan Randi´c, Nella Lerš, Dejan Plavši´c, SubhashC Basak, and Alexandru T Balaban. Four-color map representation of dna or rna sequences and their nu­merical characterization. Chemical physics letters, 407(1):205–208, 2005. https://doi.org/10. 1016/j.cplett.2005.03.086. [9] Martin Krzywinski, Jacqueline Schein, Inanc Birol, Joseph Connors, Randy Gascoyne, Doug Horsman, StevenJJones, and MarcoAMarra. Circos: an infor­mation aesthetic for comparative genomics. Genome research, 19(9):1639–1645, 2009. https://doi. org/10.1101/gr.092759.109. [10] Jiyuan An, John Lai, Atul Sajjanhar, Jyotsna Ba-tra, Chenwei Wang, and Colleen C Nelson. J-circos: an interactive circos plotter. Bioinformatics, 31(9):1463–1465, 2015. https://doi.org/10. 1161/CIRCULATIONAHA.115.015220. [11] Bo Liao andKequan Ding.A3d graphical represen­tation of dna sequences and its application. Theoreti­ cal Computer Science, 358(1):56–64, 2006. https: //doi.org/10.1016/j.tcs.2005.12.012. [12] Alexey Pasechnik, Aleksandr Mylläri, Tapio Salakoski,AMylläri,TSalakoski, andTSalakoski. Dynamical visualization of the dna sequence and its nucleotide content. Proceedings of KRBIO, 5:47–50, 2005. [13] Min-Ah Kim, Eun-Jeong Lee, Hwan-Gue Cho, and Kie-Jung Park. A visualization technique for dna walk plot using k-convexhull. Journal of WSCG,5(1­3):212–221, 1997. [14] Daegeon Kwon. Whole genome data visualization and analysis using 3d random walk plot. Master’s thesis, Pusan National University, 2015. [15] Lee Da-Young, KimKyung-Rim, KimTaeyong, and Cho Hwan-Gue. Comparison-specialized visualiza­tion model for whole genome sequences. Journal of WSCG, 24(2):43–52, 2016. [16] Hwan-gue Cho Dayoung Lee, Daegeon Kwon. Web-GL basedVisualization System for Whole Genomes. In Proceedings of KIISE, pages 1414–1416.KOREA INFORMATION SCIENCE SOCIETY, 2016. [17] Milan Randi´c-Topi´ c, Jure Zupan, DraženViki´c, and Dejan Plavšic.´A novel unexpected use of a graphi­cal representation of dna: Graphical alignment of dna sequences. Chemical Physics Letters, 431(4):375– 379, 2006. https://doi.org/10.1016/j. cplett.2006.09.044. https://doi.org/10.31449/inf.v42i3.1855 Informatica 42 (2018) 369–373 369 Cancelable Fingerprint Features Using Chaff Points Encapsulation Mokhled S. Al-Tarawneh Computer Engineering Department, Faculty of Engineering, Mutah University, B.O.Box (7), Mutah 61710, Jordan E-mail: mokhled@mutah.edu.jo Keywords: fingerprint, feature extraction, minutiae, cancelable, encapsulation, chaff points Received: September 20, 2017 Recently, biometrics imaging is widely used in several security areas such as security monitoring, database access, border control and immigration, and for reliable personal verification, identification and recognition schemes. To determine or confirm the identity of an individual's based on their physiological and/or behavioral characteristics, biometric features must be used. The aim of this paper is to review cancelable biometric generation and protection schemes. An approach for generating chaff points for fingerprint template features encapsulation as fingerprint cancelability infrastructure has been presented. Results show that strong positive correlation of original minutiae scores go with high decapsulated minutiae scores. To test the given cancelable approach performance two indexes are used, FAR (false accept rate) and FRR (false reject rate). Povzetek: Razvita je nova biometrična metoda računalniškem algoritmu. Introduction Biometrics increasingly forms the basis of identification and recognition across many sensitive applications[1]. Biometrics is statistical analysis of people's physical and behavioral characteristics, It is more convenient for users, reduces fraud and is more secure. Fingerprint is commonly used modality compared to traditional identification and verification methods, such as plastic identification card, or traditional passwords [2]. Fingerprint authentication has two phases, enrolment and authentication (or verification). Enrolment involves measuring an individual’s biometric data to construct a template for storage. Authentication involves a measurement of the same data and comparison with the stored template [3]. The core of any biometric system is the extracted template, where the matcher algorithms in this systems depends on template matching in one to one (verification) and one to many (identification) modes. It has become critical to protect fingerprint templates in the widespread biometric community. One way for doing this is using cancelable techniques, which transforms original templates in a non-invertible way and uses those transformed templates to verify a person’s identity. Securing a stored fingerprint template and image is of paramount importance because a compromised fingerprint cannot be easily revoked. That why fingerprint template should be protected, where an ideal biometric template protection scheme should possess the following four properties [2]. 1) Diversity: if a revoked template is replaced by a new model, it should not correspond with the former. This property ensures the privacy. 2) Revocability: It should be possible to revoke a compromised template and replace it with a new one based on the same biometric data. 3) Security: It must be computationally hard to obtain the original template from the protected template. This property prevents an za prepoznavanje prstnih odtisov, temelječa na adversary from creating a physical spoof of the biometric trait from a stolen template. 4) Performance: The biometric template protection scheme should not degrade the recognition performance, false acceptance rate (FAR) and false rejection rate (FRR) of the biometric system[4]. Due to some biometric vulnerabilities, lack of security because it is impossible to revoke biometric unlike password or token, and therefore if biometric is leaked out once and threat of forgery has occurred, the user cannot securely use his biometric anymore. The only remedy is to replace the template with another biometric feature. However, a person has only a limited number of biometric features [5]. In order to overcome the vulnerabilities of biometric systems, both biometrics and crypto research communities have addressed some of the challenges, one of them is cancelable biometric which gained a lot of interest in recent years [6]. The concept behind the cancelable biometrics or cancelability is a transformation of a biometric data or extracted feature into an alternative form, which cannot be used by the imposter or intruder easily, and can be revoked if compromised. This paper proposed a cancelability method based on chaff point encapsulation to cope with biometric drawbacks. The method was tested according to performance evaluation factors. 2 Related works Cancelable biometric generation has gained a lot of interest in recent years, and it is studied from different point of views, it could be categorized as: 1-Biometric Crypto Systems, this approach is used key binding or key generation schemes, where key binding is a user specific key or a helper data which is independent to the biometric data, while key generation is generating the helper data from the biometric data using specific notations of crypto systems[7] [8] [9] [10]. 2-Biometric Transformations: This approach is based on the transformations of biometric features, where it is categorized into two ways: Bio-Hashing which is used an external key source (PIN or Password) and other functional parameter representation to generate Hash value of the biometric data, it stores the Hash value alone in the data base [11] [12] [13] and Non-invertible transformation [14] [15], such that no information can be revealed from the cancelable biometrics template, which is stored in databases for personal identification/verification, or using biometric data to transform its cancellable domain by polynomial functions and co-occurrence matrices[16]. The proposed method will use encapsulation techniques to protect biometric template. Thus, cancelable template can be attained by template chaff point’s encapsulation, where the principal objectives of cancellable biometrics templates can be checked, such as diversity, cancelability, reusability, non-invertability, and performance of technique. 3 Fingerprint feature extraction The information carrying features in a fingerprint are the line structures, called ridges and valleys[17]. Figure 1, the ridges are black and the valleys are white. It is possible to identify two levels of detail in a fingerprint. Based on carried ridge and valleys minutiae points could be extracted. The minutiae provide the details of the ridge-valley structures, like ridge-endings and bifurcations. Minutiae are subject to post-processing to verify the validity of that are extracted using standard minutiae extraction algorithms. In this study the needed information to be extracted are minutiae coordination’s (x, y), type of minutiae (ridge ending or bifurcation), and orientation. Table 1, shows some extracted samples from FVC2004, DB1_B database. Figure 1: Minutiae-based Fingerprint Extraction. Due to the importance of extracted fingerprint features (minutiae) and it is criticality as a major step in designing a secure biometric system. The protection of feature templates of the users those are stored either in a central database or on smart cards. If it is compromised, it leads to serious security and privacy threats, it is not possible for a legitimate user to revoke his biometric identifiers and switch to another set of uncompromised identifiers, that why we were looking for a technique to protect this extracted temples, encapsulation technique could solve previous problems. A FVC2002 database[18] with best extraction algorithm based on high scores on distributions, acceptance and rejection rates was chosen to be based for cancelable encapsulation algorithm. For accurate algorithms in extracting minutiae features for creating encapsulation cancelable based system, a comparison result of performance evaluation according to values of False acceptance rate (FAR), False rejection rate (FRR) and Error equal rate (EER) was explored, Table 1, all comparison algorithms took coordination, type and orientation as parameters for extracted features. FVC2002,DB1_1,101_1 FVC2002,DB1_1,107_1 X Y Type Orient X Y Type Orient 216 46 3 0.5030 254 38 1 0.7503 190 49 1 3.5827 218 58 3 0.5566 146 64 1 3.2684 160 68 3 2.6141 247 80 1 0.7002 187 74 1 6.1710 173 86 1 0.3665 155 79 1 5.5414 302 93 1 0.8371 162 87 1 2.3393 176 127 3 0.2761 140 130 1 4.9955 227 131 3 0.5634 107 138 3 5.1562 164 135 1 3.3159 156 139 1 1.8174 117 140 1 5.7642 245 139 1 1.0242 196 181 1 3.7386 195 140 3 0.4140 176 187 3 0.4612 195 151 3 3.7130 151 195 1 5.7175 225 156 3 0.9941 285 215 3 0.7886 196 165 1 4.5301 227 218 1 0.8160 151 186 1 4.7262 152 219 1 2.2884 295 188 3 0.9923 169 233 1 4.1407 135 200 3 4.7436 147 242 1 4.6064 241 218 1 4.1310 186 250 1 4.1676 287 239 3 0.7131 Table 1: Extracted minutiae points with data(x, y, t, .). That why minutiae extraction points with previous references (x, y, t, .) was taken as a fundamental step for proposed framework and future method of cancelability. 4 Proposed framework A novel method is proposed in this section. It is name as encapsulation protection method. It includes the building blocks of phases such as preprocessing, minutiae extraction, post processing and cancelable and irrevocable template generation. The proposed method uses fingerprint biometric to generate cancelable template. The system level design of the proposed method is given in figure 2. Figure 2: System level design for fingerprint cancelable template generation. In preprocessing stage a feeding input is the original fingerprint image taken from database DB1_1 [18], where automatic cropping technique was applied based on image background to detect the region of interest (ROI) of target image. ROI image was given to enhancement step as a part of pre-processing stage because the quality of fingerprint structure (ridge, valley) is an important characteristic. An enhancement technique applied in pre-processing phases as normalization, ridge segmentation, structure orientation estimation, frequency enhancement estimation and thinning to get binarization image which is pre extracting feature identification figure 3. After binarization and thinning process, a Cross Number algorithm (CN) described in [19, 20] was applied to get minutiae extraction. The CN algorithm is working on pixel representation to detect all minutiae, while the false minutiae can be eliminated at the post-processing stage by validating algorithm to get only genuine features. Figure 3: A result of proposed frame work, original, enhanced, normalized, filtered and binarized images. Cancelable feature generation The basic idea of cancelable feature generation as encapsulation method is to compute encapsulation chaff points (ECP) based on original extracted minutiae, where it used to recover the enrolled template on transmission stage, as well matching on the same stage. Pseudo-code of ECP is given in Algorithm 1: Algorithm 1: Encapsulation method based on cancelable feature generation. Input Extracted minutiae template with (x,y) coordinates, T-type of minutiae {3 bifurcation, 1 ridge ending}, .-orientation, m number of minutiae, .(x, y, T). Step 1: Perform chaff points For k=1: m Y=change X(x › y, y› x, T=T+1) End for Step 2: Mix new chaff point with original minutiae Z=(X, Y) concatenate Output Z(x,y,T) End Algorithm Informatica 42 (2018) 369–373 371 A representation of original extracted minutiae for FVC2002, DB1_1,101_1 from table1 shown in figure 4. 300 250 200 150 100 50 0 Figure 4: Original extracted minutiae representation. Applying this algorithm on FVC2002, DB1_1,101_1 from table 1 will give figure 5. a chaff points, while a mixing encapsulation result will be shown in figure 6. 350 300 250 200 150 100 Gen Enc 50 0 0 100 200 300 400 Figure 6: Mixing encapsulation of original minutiae and chaff points representation. A Decapsulation part of proposed frame work is used to open up transmitted encapsulated data, separate faked chaff points from original minutiae points. The following algorithm explaining the procedure of computing decasulation chaff points (DCP), Pseudo-code of DCP is given in Algorithm 2: Algorithm 2: Decapsulation method to wrap up genuine minutiae points. Input Encapsulated template with (x, y) coordinates, T-type of minutiae {3 bifurcation, 1 ridge ending, 2 and 4 fakes}, .-orientation, mnumber of minutiae, .(x, y, T). Step 1: Read transmitted encapsulated template X= Find fake chaff points Step 2: Y= divide a template on base of chaff point with their types Z=(X, Y) separate Output Z(x,y,T) End Algorithm A representation of this process is shown in figure 7. 6 Experimental study An empirical study is performed to test the cancelability and irrevocability of the proposed method using linear correlation test of general original clear minutiae with decapsulated minutiae scores, the strength and nature of the linear relationship between two scores of clear and decapsulated minutiae. Applying linear coefficient (R) formula on given results, the value of R is found to be 0.9999. This is a strong positive correlation, which means that high original minutiae scores go with high decapsulated minutiae scores (and vice versa) figure 8. Another test was done to check the performance of proposed method; it was evaluated by calculating false acceptance rate (FAR) as well false reject rate (FRR) for scenario, original extracted and decapsulated templates. Sequence of experiments is made on the proposed method using benchmark databases such as FVC (Fingerprint Verification Contest) in 2002, 2004 figure 9, figure 10. 7 Conclusion An approach for generating chaff points for fingerprint template features encapsulation as fingerprint cancelability infrastructure has been presented. The approach takes advantage of fingerprint extracted information (minutiae points) to provide a novel way of generating chaffs from original ones. In addition this approach provides encouraging prospects to be used as platform of cancelable fingerprint feature extraction. From all the results, it could be able to prove that this approach with the usage of general extracted minutiae based new chaff points gave a better performance results and it is experienced as an efficient method for irrevocability and cancelablity of fingerprint template encapsulation. References [1] Punithavathi, P. and G. Subbiah, Can cancellable biometrics preserve privacy. Biometric Technology Today, 2007. 7: p. 8-11. https://doi.org/10.1016/S0969-4765(17)30138-8 [2] Jain, A., K. Nandakumar, and A. Nagar, Biometric Template Security. EURASIP Journal on Advances in Signal Processing, 2008: p. 1-17. https://doi.org/10.1155/2008/579416 [3] Ang, R., R. Safavi–Naini, and L. McAven, Cancelable key-based fingerprint templates, in Proc of the Australasian Conf. on Information Security and PrivacyACISP’05,242-252 2005. https://doi.org/10.1007/11506157_21 [4] Moujahdi, C., et al., Spiral Cube for Biometric Template Protection, in Image and Signal Processing. 2012. p. 235-244. https://doi.org/10.1007/978-3-642-31254-0_27 [5] Hirata, S. and K. Takahashi, Cancelable Biometrics with Perfect Secrecy for Correlation-Based Matching, ICB 2009,LNCS, Tistarelli, M and Nixon, M.S. (Eds), Springer, 2009. 5558: p. 868­878. https://doi.org/10.1007/978-3-642-01793-3_88 [6] Patel, V.M., N.K. Ratha, and R. Chellappa, Cancelable Biometrics: A Review. IEEE Signal Processing Magazine, 2015. 32(5): p. 54-65. https://doi.org/10.1109/MSP.2015.2434151 [7] Reiter, M., et al. Cryptographic key-generation from voice. in IEEE Computer Society Symposium on Research in Security and Privacy. 2001. USA. https://doi.org/10.1109/SECPRI.2001.924299 [8] Uludag, U., et al., Biometric cryptosystems: issues and challenges. Proceedings of the IEEE, 2004. 92(6): p. 948-960. https://doi.org/10.1109/JPROC.2004.827372 [9] F.Hao, R. Anderson, and J. Daugman. Combining crypto with biometrics effectively. in IEEE Transactions on Computers 2006. https://doi.org/10.1109/TC.2006.138 [10] Ratha, N.K., et al., Generating Cancelable Fingerprint Templates. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2007. 29(4): p. 561-572 https://doi.org/10.1109/TPAMI.2007.1004 [11] Viellhauer, C., R. Steinmetz, and A. Mayyerhofer. Biometric hash based on statistical features of online signatures. in the International conference on Pattern Recognition. 2002. https://doi.org/10.1109/ICPR.2002.1044628 [12] Goh, A. and D.C.L. Ngo, Computation of Cryptographic Keys from Face Biometrics, in Communications and Multimedia Security. Advanced Techniques for Network and Data Protection: 7th IFIP-TC6 TC11 International Conference, CMS 2003, , A. Lioy and D. Mazzocchi, Editors. 2003, Springer Berlin Heidelberg: Berlin, Heidelberg. p. 1-13. https://doi.org/10.1007/978-3-540-45184-6_1 Informatica 42 (2018) 369–373 373 [13] R.Ang, R.Safav-Naini, and L.McAven. Cancelable Key-based Fingerprint Templates. in 10th Australian Conf, Information Security and Privacy. 2005. https://doi.org/10.1007/11506157_21 [14] Ratha, N.K., et al., Generating Cancelable Fingerprint Templates. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2007. 29: p. 561-572. https://doi.org/10.1109/TPAMI.2007.1004 [15] Nagar, A. and A.K. Jain. On the security of non-invertible fingerprint template transforms. in 2009 First IEEE International Workshop on Information Forensics and Security (WIFS). 2009. https://doi.org/10.1109/WIFS.2009.5386477 [16] Dabbah, M.A., W.L. Woo, and S.S. Dlay. Secure Authentication for Face Recognition. in IEEE Symposium on Computational Intelligence in Image and Signal Processing. 2007. USA. https://doi.org/10.1109/CIISP.2007.369304 [17] Palmer, L.R., et al. Efficient fingerprint feature extraction: Algorithm and performance evaluation. in 2008 6th International Symposium on Communication Systems, Networks and Digital Signal Processing. 2008. https://doi.org/10.1109/CSNDSP.2008.4610735 [18] FVC2002, F.w.s. [cited; Available from: http://bias.csr.unibo.it/fvc2002]. [19] Zhao, F. and X. Tang. Preprocessing for skeleton-based fingerprint minutiae extraction. in Proc. Int'l Conf Imaging Science, Systems, and Technology (CISST). 2002. https://doi.org/10.1016/j.patcog.2006.09.008 [20] Sudiro, S.A., M. Paindavoine, and M. Kusuma. Simple Fingerprint Minutiae Extraction Algorithm Using Crossing Number On Valley Structure. in IEEE Workshop on Automatic Identification Advanced Technologies, 2007. 2007. https://doi.org/10.1109/autoid.2007.380590 https://doi.org/10.31449/inf.v42i3.1559 Informatica 42 (2018) 375–399 375 Using Semantic Perimeters with Ontologies to Evaluate the Semantic Similarity of Scientific Papers Samia Iltache Department of Computer Science, Mouloud Mammeri University, Tizi ouzou, Algeria E-mail: siltache@gmail.com Catherine Comparot IRIT, Université de Toulouse, CNRS, INPT, UPS, UT1, UT2J, France E-mail: Catherine.Comparot@irit.fr Malik Si Mohammed Department of Computer Science, Mouloud Mammeri University, Tizi ouzou, Algeria E-mail: m_si_mohammed@esi.dz Pierre-Jean Charrel IRIT, Université de Toulouse, CNRS, INPT, UPS, UT1, UT2J, France E-mail: Charrel@univ-tlse2.fr Keywords: domain ontologies, semantic annotation, classification, conceptual graph, semantic perimeter, text similarity Received: March 17, 2017 The work presented in this paper deals with the use of ontologies to compare scientific texts. It particularly deals with scientific papers, specifically their abstracts, short texts that are relatively well structured and normally provide enough knowledge to allow a community of readers to assess the content of the associated scientific papers. The problem is, therefore, to determine how to assess the semantic proximity/similarity of two papers by examining their respective abstracts. Given that a domain ontology provides a useful way to represent knowledge relative to a given domain, this work considers ontologies relative to scientific domains. Our process begins by defining the relevant domain for an abstract through an automatic classification that makes it possible to associate this abstract to its relevant scientific domain, chosen from several candidate domains. The content of an abstract is represented in the form of a conceptual graph which is enriched to construct its semantic perimeter. As presented below, this notion of semantic perimeter usefully allows us to assess the similarity between the texts by matching their graphs. Detecting plagiarism is the main application field addressed in this paper, among the many possible application fields of our approach. Povzetek: Prispevek obravnava uporabo ontologij za primerjavo znanstvenih besedil. Poglavitna uporaba je odkrivanje plagiacije. 1 Introduction Assessing query-text or text-text similarity is the concern links between the terms are ignored, which generates of several research domains such as information retrieval erroneous matching. and automatic classification of documents. For many In this paper, an approach to assess the similarity works, a document is represented by a vector of words. between texts is presented, focusing on the similarity of The very large size of the vectors reduces the scientific abstracts. This approach is based on a semantic effectiveness of these approaches and often requires classification of documents using domain ontologies reducing the number of dimensions to represent the which provides a more stable base than a learning document vectors. Some approaches are based on a corpus. A document is no longer represented by a set of learning corpus to compute the similarity between texts, characteristics independent of each other, but by a as is done in the field of document classification. conceptual graph extracted from the ontology to which However, a large text corpus may not always be the document is attached. The similarity between two available and the result of the document classification documents is evaluated by comparing their respective depends and varies according to the chosen learning graphs. corpus. The similarity is based on the morphological One of our propositions is to refine this process of comparison of the terms composing the query and the semantic comparison through a generic structuring of an documents. The polysemy and synonymy inherent in the abstract of a scientific paper into distinct parts whose presence of certain terms of the language as well as the descriptive roles are different. The global similarity of two abstracts will indeed be different according to whether one compares, for example, the contribution or the context of the paper, both evoked in the abstract. The proposed process constitutes a solution that can answer many problems requiring semantic comparison, as is the case, for example, in Semantic Information Retrieval. Finally, the relevance of our approach is examined by using it to highlight risks of plagiarism (expressing identical ideas using different terms), or even self-plagiarism (identical results published more than once by their authors, voluntarily using different terms). In addition to an original process to compare the abstracts of scientific papers based on domain ontologies, and combine a classification process with a semantic comparison of conceptual graphs, one of our main contributions is the introduction of the concept of semantic perimeter which is obtained by an ontology enrichment process. The semantic perimeter plays an important role in semantic comparison as shown by our results. Our approach also introduces the possibility of structuring scientific abstracts in three distinctive parts, generally respected by authors, namely Context, Contribution and Application domain. Finally, this constitutes a complete process for semantic text comparison, starting by using domain ontologies, and reaching text similarity. Section 2 of this paper covers some work related to our problematic. Section 3 describes the different steps of our text classification and comparison process and explains how to perform this process using scientific abstracts. Finally, Section 4 presents the experimentation results of our process, followed by a conclusion on the interest of such an approach and its applicability on several domains, such as giving a useful approach to constituting a documentary fund on a given knowledge domain by collecting relevant papers, which is more powerful than a mere keyword-based approach, or detecting plagiarism, which is our main purpose here. 2 Related work 2.1 Word similarity Similarity measures are necessary for various applications in natural language processing such as word sense disambiguation [1] and automatic thesauri extraction [2]. They are also used in Web related tasks such as automatic annotation of Web pages [3]. Two classes of approaches dealing with word similarity measure can be distinguished. Distributional approaches [4] consider a word based on its context of appearance. Words are represented by a vector of words that co-occur with them. Latent Semantic Indexation [5] is a vectorial approach that exploits co-occurrences between words. It reduces the space of words by grouping co-occurring words in the same dimensions using Singular Value Decomposition. The textual content of Wikipedia [6][7] and the Neural networks [8][9] are used for distributional word similarity to define the context of a word. In the second category, the similarity of two words is based on the S. Iltache et al. similarity of their closest senses. For this purpose, a lexical resource is used, such as WordNet and MeSH. The nodes at these resources represent the meaning of the words. Measures that make it possible to calculate the degree of proximity (distance) between two nodes have been defined. Several approaches can be identified for calculating of such distances: Approaches based only on the hierarchical structure of the resource [10][11][12][13]. The measure proposed in [11] is based on edge counting and the measure proposed in [12] is based on the notion of least common super-concept; that is, the common parent of two nodes, the furthest from the root. In [13], the proposed measure takes into account the minimum distance between two nodes to their most specific common parent (cp) and the distance between cp and the root. Some approaches include information other than the hierarchical structure information, such as statistics on nodes or the informative content of nodes. To represent information content value, probabilities based on word occurrences in a given corpus are associated with each concept in the taxonomy [14][15]. Resources, such as Wikipedia [16][17] and Wiktionary [18], are also used in measuring word similarity. 2.2 Text similarity The purpose of calculating text similarity is to identify documents with similar or different content. The different approaches dealing with textual similarity can be classified into three categories: approaches based on vector representation of document content, approaches applying text alignment, and approaches based on a graphical representation of documents and queries. Some approaches relating to each category are cited below. 2.2.1 Vector similarity A text (document or query) is projected into a vector space where each dimension is represented by an indexing term. Each element of a vector consists of a weight associated with an indexing term. This weight represents the importance of a term and is calculated on the basis of TF-IDF [19] or its variants. The vector similarity is computed using several metrics such as the cosine measurement which measures the cosine of the angle formed by the vectors corresponding to the texts. Two texts are similar if their vectors are close in the vector space in which they are represented. -Document retrieval The vector model is proposed by Salton in the SMART system [20]. To retrieve the documents that best meet a user need, a document and a query are represented by a vector. The relevance of a document to a query is measured by a similarity based on the distance between their respective vector. Adaptations of the basic model have been proposed for processing structured documents [21][22]. The Extended Vector Space Model is one of the first adaptations of the vector model proposed by Fox [22]. A document is represented by an extended vector containing different information classes referred to as objective identifiers (denoted by c-type) such as author, title and bibliographic references. The similarity between a document d and a query q is computed by a measure of similarity which is a linear combination of the different sub vector similarities. Conventional Information Retrieval considers documents only based on their textual content. The evolution of the document content towards a structured representation and more precisely towards the XML format raises new issues. In [23], the author presents a Searching XML documents through xml fragments. A fragment is a text delimited by a structure. The queries are transformed into XML fragments and, for each document, a profile is created. This profile is represented by a vector composed of the pairs (t, c), where c is the context of appearance of the term t. The context is assimilated to the element with its path. An entry in the index is no longer a term but a pair (t, c). Another adaptation of the vector model described in [24] based on the computation of the cosine makes it possible to compute the similarity between a node n, belonging to a tree representing a document, and a query q. In [25], the corpus is represented by a labeled tree where each sub­tree is considered as a logical document. The authors introduce the notion of structural term (s-term) which is a labeled tree. An s-term may be an element, an attribute, or a term. The similarity between a query and a document is computed by the scalar product of the vectors. The weight of the terms is computed during the retrieval phase since the notion of logical tree is defined according to the structure of the query. -Document classification. Automatic texts classification makes it possible to group documents dealing with similar themes around the same class. Supervised classification approaches assign documents to predefined classes [26][27][28] while unsupervised classification approaches automatically define classes, referred to as clusters, [29]. In the supervised classification, classifiers use two document collections: A collection containing training documents to determine the characteristics of each category and a collection containing new documents to be automatically classified. The classification of a new document depends on the characteristics selected for each category. There are various supervised machine learning classification techniques. In [30], the author provides a comparison of their features. The method based on the K Nearest Neighbors (KNN) [28][31] assumes that if the vectorial representations of two documents are close in vector space, they have a strong probability of belonging to the same category. A new document d is compared with documents belonging to the training set. The category assigned to document d depends on the category of its K nearest neighboring documents. To determine the category to be assigned to the document d, the most assigned class to the K neighbors closest to d is chosen or a weight is assigned to the different classes of k nearest neighbors according to the classification of these neighbors. Thus the class with the highest weight will be retained. Informatica 42 (2018) 375–399 377 With Support Vector Machines (SVM), documents are represented in a vector space by the indexing terms that compose them. Using a training phase, this method defines a separating surface, called hyperplan, between the documents belonging to two classes which maximize the distance between this hyperplan and the nearest documents and minimizes categorization errors [32]. A category c is assigned to a new document d as a function of the position of d relative to the separating surface. Some classifiers create a "prototype" class from the training collection [26]. This class is represented by the mean vector of all the document vectors in the collection. Only some features are retained which constitutes a loss of information. Some approaches replace the training collection with data extracted from "world knowledge" such as Open Directory Project (ODP) [33]. Other approaches exploit thesauri or domain ontologies with conventional classifiers (SVM, Naive Bayes, K-means, etc.) and represent a document by a vector whose features are concepts or a set of terms and concepts [29][34][35]. As reported in [36], approaches using the vector representation of documents have several limitations: Their performances decrease as soon as they apply to relatively long texts. With the weighting formulas used, words appearing only once in the document or, on the contrary, words that are often repeated are ignored although they have a meaning with respect to the content of the document. The vector representation as defined does not highlight the relationships between words in a document, thus generating erroneous matching. A document is represented by a vector whose size is equal to the number of features retained to represent the various categories, in the case of classification, and the number of terms used to represent the corpus, in the case of information retrieval. In [37], the authors studied the impact of the number of dimensions on the "nearest neighbor" problem. Their analysis revealed that when this number increases, the distance to the nearest data point approaches the distance to the farthest data point. 2.2.2 Sentence alignment Approaches dealing with sentence alignment are divided into three categories. Syntactic approaches based on morphological word comparison, semantic approaches using sentence structure and approaches that combine syntax and semantics. Gunasinghe [38] proposes a hybrid algorithm that combines syntactic and semantic similarity and uses a vectorial representation of sentences by using WordNet. This algorithm takes into account two types of relationship in the sentence pairs: relationships between verbs and relationships between nouns. Liu [39] proposes an approach to evaluate the semantic similarity between two sentences. They use a regression model, Support Vector Regression, combined with features defined using WordNet, corpus, alignment and other features to cover various aspects of sentences. Other approaches perform the text alignment by comparing all the words preserving their order in sentences. However, these algorithms are rather slow and they do not dissociate terms describing the theme of the document from those used to build sentences. In [40], authors use a text alignment algorithm [41] to align a text with the set of documents in a corpus. This algorithm uses a matrix in which the deletion or insertion of a word is represented by -1, a mismatch by a 0 while a match is represented by its IDF weight. The authors use a full-text alignment where the highest score from any cell in the alignment matrix represents the similarity score of two texts. In [42], authors introduce a new type of sentence similarity called Structural Similarity for informal, social network styled sentences. Their approach eliminates syntactic and grammatical features and performs a disambiguation process without syntactic parsing or POS Tagging. They focus on sentence structures to discover purpose-or emotion-level similarities between sentences. 2.2.3 Graph similarity Assessing of the graph similarity is used, in particular, in the field of Information Retrieval. The document and query are both represented by a conceptual graph constructed from a domain ontology or a thesaurus. In the domain of Semantic Information Retrieval, Dudognon [43] represents the documents by a set of "annotations". Each annotation consists of several conceptual graphs. The similarity between two graphs is defined as the weighted average of the similarities between the concepts that compose this graphs and the similarity between two "annotations" is computed by the mean of similarities of their conceptual graphs. Baziz [44] suggests constructing a graph for each document and for each query using concepts extracted from WordNet. A mapping of the graph of a document to that of the query leads the author to represent the two graphs with respect to the same reference graph made up of nodes belonging to the document and to the query. Each graph is then expanded by adding nodes of the reference graph. The weights of the nodes added to the query are zero whereas in the sub-tree of the document where a node is added, the weight of a level s node is updated recursively by multiplying the weight of the level s+ 1 node (the level s node subsumes the level s+ 1 node) by a factor which depends on the hierarchy level. The two representations are then compared using fuzzy operators and a relevance value is computed. This value expresses the extent to which the document covers the subject expressed in the query. Shenoy [45] represents a document by a "sub-ontology" constructed using the demo version of ONTO GEN Ontology Learner which is part of the TAO Project. Two documents are compared by applying the alignment of their "sub-ontology" based on the number of concepts, properties and relationships contained in each document. In [46], the authors propose a unified framework of graph-based text similarity measurement by using Wikipedia as background knowledge. They call each article in Wikipedia a Wikipedia concept. For each document, the authors extract representative keywords or phrases and then map them into Wikipedia concepts. These concepts constitute the nodes at the bottom of the bipartite graph. There is an S. Iltache et al. edge between a document node and a concept node if the concept appears in the specific document. The weight of the edge is determined by the frequency of the concept’s occurrence in that document. The similarity of two documents is determined by the similarity of the concepts they contain. The authors in [18] present a unified graph-based approach for measuring semantic similarity between linguistic items at multiple levels: senses, words, and sentences. The authors construct different semantic networks. One of them is based on WordNet. The nodes in the WordNet semantic network represent individual concepts, while edges denote manually-crafted concept-to-concept relations. This graph is enriched by connecting a sense with all the other senses that appear in its disambiguated gloss. Measuring the semantic similarity of a pair of linguistic items consists of an Alignment-based Disambiguation and a random Walk on a semantic network. In [47], authors propose a graph-based text representation, which is capable of capturing term order, term frequency, term co-occurrence, and term context in documents. A document is represented by a graph. A node represents a concept: a set of single word or phrase and an edge is constructed based on proximity and co-occurrence relationship between concepts. In addition; the associations among concepts are represented through their contexts. The nodes within the window (e.g. paragraph, sentence) are linked by weighted bidirectional edges. The approach described in [48] presents a graph-based method to select the related keywords for short text enrichment. This method exploits topics as background knowledge. The authors extract topics and re-rank the keywords distribution under each topic according to an improved TF-IDF-like score. Then, a topic-keyword graph is constructed to prepare for link analysis. In [49], the authors create a semantic representation of a collection of text documents and propose an algorithm to connect them into a graph. Each node in a graph corresponds to a document and contains a subset of document words. The authors define a feature and document similarity measures based on the distance between the features in the graph. 2.3 Detecting plagiarism Plagiarism consists in copying a work of an author and presenting it as one’s own original work. Plagiarism detection systems usually have the original document and the suspicious document as inputs. They focus on the following points: an exact copy of the text (copy/paste), inserting or deleting words, substituting words (use of synonyms), reformulation and modification of sentences structure. In n-gram approach, a text is characterized by sequences of n consecutive characters [50][51][52]. Based on statistical measures, each document can be described with so called fingerprints, where n-grams are hashed and then selected to be fingerprints [53]. An overlap of two fingerprints extracted from the suspicious and source documents indicates a possibly plagiarized text passage. Statistical methods [54] do not require an understanding of the meaning of the documents. The common approach is to construct the document vector from values describing the document such as the frequency of terms. Comparing the source document with the suspicious document, amounts to calculating their degree of similarity on the basis of different measures (BM25, language model, etc.). Vani [55] segments the source document and the suspicious document into sentences. Each sentence is then represented by a vector of weighted terms that compose it. Each sentence of the source document is compared to all the sentences of the suspicious document and similarity between two vectors is computed using, individually, several metrics (cosine, dice, etc.). Vani studies the importance of the combination of these various metrics on detecting plagiarism. He also explores the impact of the use of POS Tagging on calculating of sentence similarity. The sentences labeled by a syntactic parser are thus compared by matching the terms belonging to the same class (nouns with nouns, verbs with verbs, adjectives with adjectives and adverbs with adverbs). Other approaches based on sentences alignment compute the overlapping percentage of words or sentences between the source document and the suspicious document. These methods do not permit the detection of cases of plagiarism where synonymy is used to replace words in the reformulation of sentences. The representation of a document by a graph is also used in detecting plagiarism. In [45], the alignment of "sub-ontologies" is based on the number of concepts, properties and relations corresponding to the original document and the suspicious document. Alignment is expressed as a fraction of the whole. If this fraction is above a given threshold, the system concludes that the two documents are similar in meaning. Osman [56] describes an approach of detecting plagiarism by representing documents (original and suspicious) with a graph deduced from WordNet. This approach is useful in detecting forms of plagiarism where synonymy is used to reformulate sentences. The document is divided into sentences. Each node of the graph constructed for the document represents the terms of a sentence. The terms of sentences are projected on WordNet to extract the concepts corresponding to them. Each relationship between two nodes is represented by the overlap between the concepts of the two nodes. These concepts help in detecting suspicious parts of a document. An important characteristic of our approach lies in the fact that it is not necessary to have a reference document a priori, since any document can be compared with a corpus dealing with the same knowledge domain as identified in the first step of our process that is proposed here. Proposed approach The representation of a document by a semantic graph is used in different domains such as information retrieval [43][44], plagiarism detection [45][56] and document summarization [57]. However, these graphs differ in the way they are constructed. The purpose of our approach is Informatica 42 (2018) 375–399 379 to assess the semantic similarity between textual documents. Unlike conventional approaches, a document is not represented by a vector. Our approach is to build a conceptual representation of a text in the form of a semantic graph in which the nodes and arcs correspond respectively to concepts and relationships between concepts extracted from the domain ontology chosen. The similarity between two texts is evaluated in two steps. The first step is to perform a semantic classification of documents based on domain ontologies. The classification makes it possible to deduce an overall similarity defined by the context in which the content of the document is used. The second step compares and evaluates the similarity of two texts related to the same domain ontology by comparing their constructed and enriched graph as explained in the following sections. 3.1 Classification of documents The process is based on a semantic classification of texts using domain ontologies [58]. Figure 1 summarizes the classification process. The classification groups documents according to the knowledge domain covered by their content. This grouping identifies an overall similarity and involves several steps. -Projection, extraction of terms and candidate concepts. The "projection" of a document on different ontologies helps to associate meaning to the terms of the document with respect to concepts belonging to these ontologies and to select the candidate concepts. The notion of concept gives a meaning to a term relative to the domain in which this concept is defined. The whole document is divided into sentences. Each sentence is browsed from left to right from the first word. The words of each sentence are projected, before pruning stop words, on different domain ontologies to extract longer phrases (groups of adjacent words in a sentence called "terms") that denote concepts. This choice is determined by: 1) the concepts are often represented by labels consisting of several words. An example of mono-and multi-word concepts is given in table 1. 2) long terms are less ambiguous and better determine the meaning conveyed by the sentence. Several concepts belonging to the same domain ontology may be candidates for a given term. The following example shows to what extent it is important to bring out the longest terms and the longest concept. For the sentence: "The Secretary of State for the Home Department had clearly indicated that evidence obtained by torture was inadmissible in any legal proceedings," the synsets in Table 1 are extracted from WordNet. As shown in Table 1, there are several synsets in WordNet that correspond to the words "secretary of state for the home department" in the sentence. These synsets have one or more words. Figure 1: Classification of a document. Table 1: Extraction of terms and synsets. The longest term "secretary of state for the home department" is extracted from the sentence. It corresponds to the synset secretary_of_state_for_the_ home_department (09526473), which represents the correct sense in the sentence. -Local disambiguation. In the projection step, for each ontology, all the candidate concepts for a given term are extracted. The local disambiguation process is used to select for a term t the most appropriate concept among several candidates belonging to the same ontology. To do this, the context of occurrence of the term t in the document is taken into consideration. The appropriate concept for the term t is chosen, taking into account both the semantic distance between the term t with neighboring terms, (i.e., which occur in its context), and the semantic distance between concepts associated with the term t and concepts corresponding to the neighboring terms in the ontology considered. The meaning of a term t in a document is determined by its nearest unambiguous neighbors terms. t will then be disambiguated by its nearest neighbor on the left or by its nearest neighbor on the right. In case the left and right neighbors exist simultaneously, they will both be taken into consideration. Words in a sentence Synset label in WordNet N ° synset in WordNet Secretary secretary_of_state_for_the_home_department 09526473 of secretary_of_state 09883412 09455599 00569400 State for secretary 09880743 04007053 09880504 09836400 the home Department state 07682724 00024568 13192180 08125703 07646257 13656873 07673557 08023668 home 08037383 13687178 07587703 03141215 03398332 03399133 07973910 07974113 08060597 department 07623945 08027411 05514261 The disambiguation process is then done at three levels, starting at the sentence level. For each sentence, the ambiguous terms are disambiguated considering their left and right neighbors in the sentence. Any disambiguated term helps to move forward in the process of disambiguation of next terms. This process is repeated in case ambiguous terms still remain, considering in a second step the paragraph level, and finally, if necessary, the document level. The local disambiguation process at the sentence level, summarized by the algorithm in Figure 2, considers neighboring terms, unambiguous, that have associated concepts in the ontology considered, surrounding t: it retrieves the concepts Cnl and Cnr, corresponding respectively to nl, the nearest neighbor on the left of t and nr, the nearest neighbor on the right of t. The appropriate concept for the term t among candidate concepts is the semantically nearest concept of Cnl or Cnr. This amounts to browsing the ontology and Several existing metrics in the literature are used to calculating the minimum distance between each concept calculate this minimum distance. An example of local associated with t and candidate concepts Cnl, Cnr. disambiguation in the domain anatomy of WordNet is given in the Figure3. Figure 2: Local disambiguation at the sentence level. Figure 3: Disambiguation of shoulder and hand. Table 2 shows the terms and their senses (synsets) in the domain anatomy of WordNet. The different calculated distances help in choosing the most appropriate synset for each ambiguous term. The term shoulder in the sentence is ambiguous. To disambiguate it, spinal column, its nearest unambiguous neighbor term on the left, is considered. The synset retained is 05231159. The term hand in the sentence is ambiguous. Its disambiguation is done using shoulder and skeleton, its two nearest unambiguous neighboring terms on the left and right. The synset retained is 05246212. Words in a sentence Synset label (Anatomy) N° synset Distance between synsets Terms extracted Bones bone 04966339 bone Spinal column shoulder hand skeleton Spinal_column 05268544 Spinal Column shoulder 05231159 05231380 Dist(05268544,05231159)= 0.42857143 Dist(05268544, 05231380)= 0.5 Shoulders (ambiguous) Hands (ambiguous) skeleton hand 05246212 02352577 Dist(05246212,05231159)= 0.42857143 Dist(02352577,05231159)= 0.6363636 Dist(05246212,05265883)= 0.42857143 Dist(02352577,05265883)= 0.6363636 skeleton 05265883 Table 2: Disambiguation of ambiguous terms. At the end of the preceding steps, a document d is represented by several sets of concepts extracted from the domain ontologies .i on which it was projected. These sets are represented by (1). (.) -Global disambiguation. The classifier must be able to conclude about the relevance of a document relative to a given context and to choose from the different ontological representations the one that best corresponds to its context. A score is calculated for each document. The highest score determines the candidate ontology to be selected to represent document d. The different terms in a document, taken together considering the contextual relations linking them, make it possible to conduct a semantic evaluation of the textual content. A matrix, defined by (2), is associated for each ontology and for each document. (2) The rows and columns of this matrix represent all the concepts extracted from the ontology .i for the document d. Ci is the selected concept for the term ti after projection of the document d on .i and lcicj represents the weight of the link between the concept Ci and the concept Cj (i.j). The matrix is initialized to zero. If a term ti and a term tj appear together within the same paragraph of the document d and the concepts Ci and Cj respectively correspond to the terms ti and tj, then the weight lcicj =1. The weight lcicj is updated whenever the terms ti and tj appear together in the same paragraph. The weight lcici corresponds to the appearance of the term ti in the document. It is equal to 1. The weight lcicj is updated for all paragraphs of the document d. The importance of the concept Ci in document d is determined by its total weight in d relatively to the ontology .i. This weight is given by the row associated with it in the matrix. The score for each ontology obtained from the sum of the weights of all concepts extracted from this ontology for the document d measures the extent to which each ontology represents this document. The ontology that gets the highest score will be selected to represent the document d. For documents belonging to the same knowledge domain, their "local" semantic similarity is computed. The process compares their content using their semantic perimeter – a notion that is introduced and defined later in the paper – constructed on the basis of their conceptual graph extracted from the ontology to which they are attached. 3.2 Text similarity and semantic perimeter An author describes the subject of his document by evoking one or more different notions. He can describe them by addressing several sub-notions. These notions and/or sub-notions can be described in a general or precise way according to the level of detail to be highlighted. In an ontology, there exists a certain structure defining the meaning of information representing a given Informatica 42 (2018) 375–399 383 knowledge domain and the way in which this information is related to each other. This structure is defined by several branches representing different hierarchies. Each hierarchy has branches to separate data with common characteristics but also different characteristics. The tree of Figure 4, inspired by the geometric figures ontology proposed in [59], shows two branches Br1 (figure) and Br2 (angle) representing two different data. Branch Br2 has two sub-branches 2.1 and 2.2 corresponding respectively to a right angle and an acute angle. Right angle and acute angle are two concepts with different characteristics but common characteristics defined by their common parent angle. Figure 4: Extract from the geometric figures ontology. 3.2.1 Objective of the approach Consider two texts Txt1 and Txt2, previously classified in the same knowledge domain represented by a domain ontology, whose similarity needs to be assessed: Sim (Txt1, Txt2). Our semantic similarity process is based on the following assumptions: 1 Each branch/sub-branch of the ontology is associated with a notion/sub-notion described in a document. 2 Concepts linked by "is-a" relations form a branch. 3 A branch can have several sub-branches. 4 Two branches with the root of the ontology as the only common parent represent two different notions. 5 Two sub-branches having a common parent represent two different sub-notions sharing common characteristics defined by their common parent. 6 The weight of an initial concept is equal to 1. 7 The weight of an added concept representing implicit information is less than 1. 8 The similarity between two texts varies between 0 and 1. Our approach is based on the identification of the branches to which the concepts of the documents belong and on the enrichment of the conceptual graphs of these documents. Associating a notion with a branch helps in identifying different and identical notions. It can be said for example that the notion "angle" is different from the notion "figure" and that the notion "triangle" is different from the notion "quadrilateral" because they belong to different branches or sub-branches. The concepts quadrilateral, parallelogram, diamond, and square belong to the same sub-branch describing the same notion. Each of them brings a degree of precision knowing that this precision is increasingly higher the further one goes down the hierarchy. Graph enrichment highlights common notions to two documents without these being explicitly cited in their content and makes it possible to deduce similarities between notions by examining the branches to which their corresponding concepts belong. 3.2.2 Graph enrichment To describe a given subject, the authors, can choose different words and different levels of description depending on the importance that each of them wishes to give to a notion addressed in the text. Thus, by adding concepts, graph enrichment makes it possible to deduce implicit information that can be shared by these two texts. Like Baziz [44], our process enriches the text graphs by adding concepts. The applied enrichment differs from that achieved by Baziz in the choice of concepts to be added and the weight assigned to these concepts. For our case, the weight assigned to the concepts helps in defining the implicit or explicit presence of a concept. A graph is enriched by constructing the semantic perimeter of its corresponding text and comparing it to another graph. 3.2.2.1 Constructing the semantic perimeter of a text Definition 1: The semantic perimeter of a text is a graph whose nodes are the initial concepts and the link concepts. Initial concepts are extracted from the domain ontology to which the document is attached. These concepts represent the information explicitly described in its content. With these concepts, a conceptual graph is constructed and enriched by link concepts representing the implicit information in the text that is deduced from the initial concepts and through browsing the "is-a" relationships and the transversal relationships defined in the domain ontology. The semantic perimeter thus constructed for each document makes it possible to evaluate their semantic similarity even if these documents describe the same ideas with different terms. -Constructing the graph of initial concepts During the classification process, a text is projected onto a set of domain ontologies. At the end of this step, the text is represented by a conceptual graph, whose nodes constitute the initial concepts. These concepts correspond to the terms explicitly cited in the document. -Constructing the semantic perimeter The link concepts extracted from the ontology, being on the shortest path linking the initial concepts Ci and Cj S. Iltache et al. by is-a relations or transversal relations, are added to the graph of a document. Link concepts are selected in order to retain only concepts that make sense in relation to the knowledge domain represented by the ontology. In fact, some concepts represented in an ontology are used to construct the structure of the ontology but have no meaning for the domain in question. Example: host and hard_disk, are two synsets extracted for a document classified in the computer_ science domain. Figure 5 shows the synsets linking them in WordNet. Figure 5: link synset linking host to hard_disk. The link synsets are: {computer 02971359, machine 03561924, device 03068033, memory_device 03604997 and magnetic_disk 03568359}. The synsets machine 03561924 and device 03068033 are not retained, since they respectively belong to the buildings domain and factotum domain. 3.2.2.2 Comparing graphs Comparing two texts Txt1 and Txt2 is carried out from their semantic perimeter G1 and G2. A mutual enrichment of these two graphs is achieved by comparing the concepts belonging to G1 with the concepts belonging to G2. Each graph enriched the other and concepts are added to G1 and/or to G2. This is done by browsing the graphs from leaf nodes to the root as follow: • If the graph G1 (the graph G2) contains a concept C1 and the graph G2 (the graph G1) contains a concept C2 such that C2 is an ancestor of C1, then the concept C2 is added to the graph G1 (to the graph G2). • The graphs are also enriched by adding the common parents to concepts belonging to graphs G1 and G2. This enrichment is done in two steps: • By considering concepts belonging only to the graph G1 (to the graph G2). • By considering the concepts belonging to graphs G1 and G2. By adding common parent concepts, graph enrichment helps in determining the common branches and sub-branches to G1 and G2 and thus to deduce an implicit similarity between Txt1 and Txt2. As an illustration, in the geometric figures domain represented by figure 4, three texts (T1, T2 and T3) are considered, and their content is as follows: T1: A square is a regular polygon with four sides. It has four right angles and its sides have the same measure. T2: A diamond is a parallelogram. Some diamonds have four equal angles. T3: A triangle has three sides. If it has a right angle, it is a right triangle. -Let us compare T1 and T2. The semantic perimeters of T1 and T2 and the comparison of their respective graphs G1 and G2 are given in Figure 6. The projection of the texts T1 and T2 on the ontology represented by figure 4, allows us to find the initial concepts to construct graphs G1 and G2. G1 is represented by the concepts (square, polygon, right angle) and G2 is represented by the concepts (diamond, parallelogram and angle). At this stage, the graphs have no common concept. Figure 6: Comparison and enrichment of graphs corresponding to T1 and T2. The enrichment of these two graphs made it possible to add concepts semantically linked to the initial concepts and to bring out common concepts to the two texts, not explicitly cited in their contents. The common concepts are diamond, parallelogram, quadrilateral, polygon and angle. -Let us compare T2 and T3. The semantic perimeters of T2 and T3 and the comparison of their respective graphs G2 and G3 are given in Figure 7. The projection of the texts T2 and T3 on the ontology, represented by figure 4, allows us to find the initial concepts to construct graphs G2 and G3. G2 is represented by the concepts (diamond, parallelogram and angle) and G3 is represented by the concepts (triangle, right triangle and right angle). The enrichment of the two graphs enabled us to find common concepts (angle and polygon). Figure 7: Comparison and enrichment of graphs corresponding to T2 and T3. 3.2.3 Calculating the similarity of two texts Definition 2: Textual similarity is defined by the set of common notions and sub-notions addressed by two texts. It is a function of the concepts corresponding to these texts, their weight and the branches to which these concepts belong. The similarity of two texts is given by the similarity of their respective graphs according to equation (3). ......(......1,......2)=......(........1,........2) (3) 3.2.3.1 Weight of the concepts The weight attributed to an initial concept is equal to 1. This weight defines the explicit presence of the concept in the document. Concepts belonging to the same branch do not have the same semantic weight: concepts at the top of the hierarchy have a more general meaning than concepts at the bottom of the hierarchy that represent a more precise meaning. The more one descends towards the bottom of the hierarchy, the more precise the meaning of the concepts is. Thus, to a concept added to graph G1 during the enrichment process, a weight whose value is less than 1 is assigned. This weight represents the value of an implicit information and is calculated based on parameter g. g expresses the degree of generalization of a parent concept vis-a-vis its child concept. Like Fuhr [60] and Baziz [44], who reduce the weight of the nodes of a tree representing a document according to their position with respect to the most specific nodes by multiplying by a factor whose value is between 0 and 1, our process computes the weight of an added concept by using parameter g whose value is between 0 and 0.1 according to equation (4). ..(....)=1-(..×(..........h(....,....)) (4) Cj is the added concept and Ci is the initial concept, belonging to G1 and/or to G2, the lowest in the branch to which Cj is added and Length (ci, cj) indicates the number of arc linking Cj to Ci in the branch. 3.2.3.2 Semantic similarity of two graphs G1 and G2 A factor is introduced indicating the percentage of common notions described by two texts. Its value is calculated by the number of common branches relative to the total number of branches belonging to the two graphs. The similarity between two graphs G1 and G2 is computed using equation (5). (5) B represents any branch belonging to the graphs while Bc represents a common branch to both graphs. C is a concept belonging to graphs G1, G2 and Ccom is a common concept to both graphs. nbBc(G1,G2) and nbB(G1,G2) respectively represent the number of common branches and the total number of branches belonging to the two graphs. 3.2.3.3 Example Let us again take the examples shown in Figures 6 and 7 and summarize the various results in Tables 3 and 4. For parameter g, the value 0.05 is used. Initially, G1 and G2 showed no concept in common and, therefore, a priori no similarity. The same applies to graphs G2 and G3. The enrichment of the graphs helped to bring out a similarity between T1 and T2, as well as between T2 and T3 that is not explicitly described in their content. The results also show that text T2 is semantically closer to T1 than to T3. Texts Concepts Type Weight T1 square initial 1 diamond link 0,95 parallelogram link 0,90 quadrilateral link 0,85 polygon initial 1 angle Ancestor 0,95 Right angle initial 1 T2 diamond initial 1 parallelogram initial 1 quadrilateral Ancestor 0,85 polygon Ancestor 0,80 Angle initial 1 Common branches 1 1.2 2 All branches 1 1.2 2 2.2 Table 3: Concepts of T1 and T2 after enriching their respective graphs. Texts Concepts Type Weight T2 diamond initial 1 parallelogram initial 1 polygon Common parent 0,85 angle initial 1 T3 right angle initial 1 right triangle initial 1 triangle initial 1 polygon Common parent 0,85 angle ancestor 0,95 Common branches 1 2 All branches 1 1.1 1.2 2.2 1.1.3 2 Table 4: Concepts of T2 and T3 after enriching their respective graphs. 3.3 Similarity of scientific abstracts Refining the process of semantic comparison of two texts (defined in section 3.2) is performed through a generic structuring of an abstract of a scientific paper into distinct parts whose descriptive roles are different. Several works have taken interest in the annotation of the discursive structure of scientific papers: text zoning [61] [62]. Their objective is to better characterize the content of the papers by defining several classes (objective, method, results, conclusion, etc.), knowing that the existence of these classes depends on the corpus studied. Categorization is performed at the sentence level. For each sentence of an abstract, authors associate a class chosen from the defined classes. Informatica 42 (2018) 375–399 387 This work deals with decomposing scientific abstracts into zones for the purpose of detecting plagiarism. From the structure generally reproduced by the authors of scientific papers, the content of a scientific abstract is divided into three distinct parts which are referred to as zones that define the context, the contribution and the application domain. This decomposition is generally reflected in most scientific papers that aim, in principle, at making a scientific contribution in a given domain. This decomposition aims to extract the notions relating to each zone and thus permits a comparison between zones of the same type. The process can then evaluate, in a progressive approach, whether two abstracts deal with the same context, whether their contributions are similar and whether they apply their approach to the same application domain, the risk of plagiarism evidently increasing with each conclusive comparison. Categorization at the sentence level poses a problem when information from one class is cited in another class. In analyzing several abstracts, it was found that there is no strict uniformity in writing abstracts: all the sentences belonging to a given zone do not contain only the terms describing this zone but may contain terms representing another zone. For example, a sentence assigned to the application domain zone may contain terms defining an algorithm or a method (terms that instead define the contribution zone). This overlapping of several zones in the same sentence then generates labeling errors. To illustrate the categorization at the sentence level, each sentence of abstract 2 provided in section (3.3.1), is associated with one of the three selected zones. "Recently, new approaches have integrated the use of data mining techniques in the ontology enrichment process. Indeed, the two fields, data mining and ontological meta-data are extremely linked: on one hand data mining techniques help in the construction of the semantic Web, and on the other hand the semantic Web assists in the extraction of new knowledge. Thus, many works use ontologies as a guide for the extraction of rules or patterns, allow to discriminate the data by their semantic value and thus to extract more relevant knowledge. It turns out, however, that few works aimed at updating the ontology are concerned with data mining techniques. In this paper, we present an approach to support the onologies management of websites based on the use of Web Usage Mining techniques. The presented approach has been tested and evaluated on an website ontology , which we have constructed and then enriched based on the sequential patterns extracted on the log. " The following inconsistencies are noted: -The term sequential pattern is assigned to the Application domain zone while it represents the algorithm and method used by the author and, therefore, defines the contribution. -The term Data mining technique is assigned to the context zone while it represents the contribution. -The term ontologies management is assigned to the contribution zone while it defines the context. To evaluate the semantic similarity of the two abstracts given in section (3.3.1), their content was previously divided as illustrated above. For each abstract, three graphs are constructed and enriched (a graph for each selected zone). For each zone, a similarity value is calculated. The similarity values obtained are very low. This is justified by assigning the terms to a zone while they semantically define another zone, a consequence of the decomposition based on categorization at the sentence level and of the overlapping of zones. To overcome this problem of overlapping of zones, the terms are assigned to each zone of an abstract according to the overall meaning conveyed by its content. From the global meaning of an abstract, the meaning and the role of its terms are deduced. A term can describe the context of the paper (document categorization, document clustering, image categorization, ontologies enrichment, information retrieval, etc.) or contribution (the methods and algorithms as well as notions used to describe them) or the application domain (classification applied to a given corpus, data mining applied to textual documents, data mining applied to the web, data mining applied to images, etc.). In addition, the terms contained in the title and in the keywords are used, as they often contain information that is not cited in the abstract. The role of each term is defined according to the knowledge domain in which it is used. The semantic annotation of the concepts was achieved especially in WordNet Domains [63]. In WordNet Domains, different subject fields are defined, such as medicine, computer science, and architecture. Each synset of WordNet [64] is annotated by one or more Subject Fields where this synset has a meaning. On the basis of the principle that a term describes one of the three zones selected to characterize a scientific abstract, each concept is annotated in the ontology associated with this abstract by one of the three zones (context, contribution and application domain). The extraction of the concepts corresponding to each zone is performed by projecting the terms composing the content of an abstract on the ontology. The comparison of two abstracts amounts to comparing the zones playing the same role. Three partial similarities are then calculated on the basis of the concepts belonging to the same zone. Two abstracts are compared at three levels. A global similarity of two scientific abstracts A1 and A2 is obtained by combining the three partial similarities according to equation (6). The global similarity makes it possible to rank abstracts in descending order of their similarity as illustrated in Tables 10, 11 and 12. S. Iltache et al. (6) ., ß, . are parameters whose values are between 0 and 1. They define the importance attributed to the context, the contribution and the application domain. . + ß+ .=1. The documents processed are not necessarily suspicious, since it is possible to implement this approach in comparing a document under review, for example, to an entire corpus, without a priori as to its respect for scientific ethics. A similarity threshold determined by experimentation and according to the ontology and to the collection of abstracts used determines if a risk of plagiarism exists. Abstracts with high similarity will then require a full review of the entire document. 3.3.1 Example Figure 8 provides an extract of an ontology associated with the domain ontologies enrichment and shows the annotation of the concepts by the three zones defined to characterize the content of a scientific abstract. Let us consider two abstracts from two scientific papers. These papers published in French were translated for the need of our work. The construction of their graphs and calculation of their partial similarities and global similarity is given in section 3.3.2. Abstract1: Ontology enrichment based on sequential pattern. The mass of information now available via the web, in constant evolution, requires structuring in order to facilitate access and knowledge management. In the context of the Semantic Web, ontologies aim at improving the exploitation of informational resources, positioning themselves as a model of representation. However, the relevance of the information they contain requires regular updating, and in particular the addition of new knowledge. In this paper, we propose an ontologies enrichment approach based on data mining techniques and more specifically on the search for sequential patterns in textual documents. The presented approach has been tested and evaluated on an ontology of the water domain, which we have enriched from documents extracted from the Web. Key words: ontology, enrichment, semantic web, data mining, sequential pattern Figure 8: Extract of the ontologies enrichment domain ontology, and annotation of concepts by their zone. Abstract2: Web usage mining for ontology enrichment. Recently, new approaches have integrated the use of data mining techniques in the ontologies enrichment process. Indeed, the two fields, data mining and ontological meta-data are extremely linked: on one hand data mining techniques help in the construction of the semantic Web, and on the other hand the semantic Web assists in the extraction of new knowledge. Thus, many works use ontologies as a guide for the extraction of rules or patterns, allow to discriminate the data by their semantic value and thus to extract more relevant knowledge. It turns out, however, that few works aimed at updating the ontology are concerned with data mining techniques. In this paper, we present an approach to support the onologies management of websites based on the use of Web Usage Mining techniques. The presented approach has been tested and evaluated on an website ontology , which we have constructed and then enriched based on the sequential patterns extracted on the log. Key words: Semantic Web, ontology, Web Usage Mining, enrichment, data mining, sequential pattern. 3.3.2 Applying our approach 3.3.2.1 Extracting the initial concepts for each abstract Initial concepts are extracted at the classification step. The two abstracts are attached to the ontology represented in Figure 8. The concepts are assigned to their appropriate zone according to their annotation. Figure 9: Enriched graph of Abstract1. Table 5: Distribution by zone of the concepts of Abstract1 and Abstract2. Abstract1 Abstract2 Zones Concepts of Abstract 1 Concept type Concepts of Abstract 2 Concept type context Ontology_management Added Ontology_management Initial Ontology_enrichment Initial Ontology_enrichment Initial Ontology Initial Ontology Initial contribution Data_mining Initial Data_mining Initial Technique Added Technique Added Data mining_technique Initial Data_mining_technique Initial Sequential_pattern Initial Sequential_pattern Initial Web_usage_mining Initial Application domain Informational_resource Initial Informational_resource Added Textual_document Initial log Initial Domain Added Domain added Water_domain Initial Website Initial 3.3.2.2 Enrichment of the graphs corresponding to the two abstracts The initial concepts are used to enrich the graphs of the two abstracts by constructing their semantic perimeter and by comparing their graphs. The enriched graphs of the two abstracts Abstract1 and Abstract2 are represented in Figure 9 and Figure 10. The distribution by zone of the initial concepts and the added concepts by enrichment is given in Table 5. 3.3.2.3 Similarity calculating between Abstract1 and Abstract2 Table 6 provides values of the global similarity and partial similarities. (Values obtained with .= 0.35, ß = 0.63, . = 0.02, g = 0.05). Simcontext(abstract1,abstract2) 0,98 Simcontribution (abstract1,abstract2) 0,59 Simapplicatiodomain(abstract1,abstract2) 0,10 Sim (abstract1,abstract2) 0,72 Table 6: Similarities between abstract1 and abstact2. 3.3.2.4 Result The results obtained indicate that these two abstracts process the same context (sim context = 0.98) with similar approaches. The similarity obtained for the contribution is high (sim contribution = 0.59). These two abstracts differ at the application domain level since the similarity value obtained for this zone is very low (sim application domain = 0.10). The global similarity obtained is high. This value indicates that the papers associated with these two abstracts should be the subject of a more in-depth analysis that could possibly reveal a case of plagiarism. Experimentations Our approach is evaluated at two levels. The first evaluation concerns our semantic classification process based on domain ontologies (CBO) and the second concerns the textual similarity calculation process of scientific abstracts. 4.1 Semantic classification process 4.1.1 The data The implementation of our semantic classification process was performed using WordNet and WordNet Domains simultaneously. In WordNet Domains several knowledge domains are used. These different domains were assimilated to domain ontologies. The Rita similarity measure [13] was used to measure the semantic distance between two synsets in WordNet. The terms within sentences were annotated with their type (noun, verb, adverb and adjective) by Stanford Part-Of-Speech Tagger (POS Tagger) [65]. To evaluate conventional classifiers with our corpus, a pre-processing was performed on the documents. Nouns, verbs and adjectives used in each document were retained. The lemmas relative to these terms were extracted and their weight based on Tf-Idf was then calculated. These lemmas constitute the vector representation of documents. For conventional classifiers, the implementation of three algorithms, SVM, Naive Bayes and decision tree of Weka [66] were used. Our evaluation covers 10 domains defined in WordNet Domains and a corpus consisting of 976 abstracts of scientific papers. Some abstracts of the domain medicine were extracted from the corpus Muchmore which is a parallel corpus of English-German scientific medical abstracts obtained from the Springer Link web site. All the other abstracts of our corpus were extracted from several scientific journals specialized in the retained domains browsing their Web site. Table 7 gives the distribution of the abstracts relative to the selected domains. Domains Number of abstracts Music 106 Law 83 Computer_science 101 Politics 76 Physics 101 Chemistry 83 Economy 104 Buildings 104 Medicine 117 Mathematics 101 Total 976 Table 7: Distribution of abstracts by domains. 4.1.2 Results and discussion Measures traditionally used in categorization are considered in this work: precision, recall, F-measure and baseline accuracy. The results of our process were compared with those of conventional classifiers. The results obtained are summarized in Table 8. The recall (Rc) determines the number of documents that are correctly classified in a class divided by the total number of documents belonging to that class. Precision (Pr) defines the number of documents that are correctly classified in a class divided by the number of documents assigned to that class. A measure that combines precision and recall is their harmonic mean, referred to as the F-measure (F). Baseline accuracy (Acc) gives the percentage of documents correctly classified relative to the total number of documents in the corpus. Classes CBO Naive Bayes SVM Tree C4.5 Pr Rc F Pr Rc F Pr Rc F Pr Rc F Music 0,962 0,943 0,952 0,835 0,906 0,869 0,963 0,981 0,972 0,913 0,887 0,900 Law 0,952 0,964 0,958 0,777 0,880 0,825 0,947 0,867 0,906 0,766 0,711 0,737 Computer_science 0,970 0,950 0,960 0,845 0,861 0,853 0,872 0,941 0,905 0,474 0,644 0,546 Politics 0,949 0,974 0,961 0,788 0,829 0,808 0,944 0,882 0,912 0,754 0,645 0,695 Physics 0,960 0,960 0,960 0,833 0,842 0,837 0,887 0,931 0,908 0,513 0,386 0,441 Chemistry 0,940 0,952 0,946 0,947 0,867 0,906 0,986 0,880 0,930 0,848 0,807 0,827 Economy 0,980 0,962 0,971 0,820 0,788 0,804 0,855 0,904 0,879 0,541 0,442 0,487 Buildings 0,980 0,962 0,971 0,950 0,913 0,931 0,925 0,952 0,938 0,757 0,750 0,754 Medicine 1,000 0,983 0,991 0,982 0,940 0,961 0,991 0,991 0,991 0,894 0,863 0,878 Mathematics 0,925 0,980 0,952 0,904 0,842 0,872 0,898 0,871 0,884 0,493 0,673 0,569 Average 0,964 0,963 0,963 0,872 0,869 0,870 0,926 0,924 0,924 0,694 0,682 0,683 Accuracy 0,963 0,869 0,924 0,682 Table 8: Comparison of the results of the various classifiers. To calculate these different values for SVM, Naive Bayes, and tree C4.5, cross-validation was performed and the results obtained with the best parameters were retained. Table 8 shows that for our process the values of recall and precision are close. These values are close to 1. This is an indicator of the good performance of our classifier. Considering the average of precisions, recalls and F-measure, our process obtains better results than the three conventional classifiers considered. The best percentage of documents correctly classified relatively to all documents in the corpus is obtained by our semantic classification process. A Wilcoxon Signed-Rank test was used in order to study the statistical significance of the improvement brought about by our process. The p-value between our system and the three conventional classifiers was calculated. This Wilcoxon Signed-Rank test is based on the values of the F-measure obtained for CBO, SVM, Naive Bayes and tree C4.5. This improvement is considered statistically significant if p-value <0.05 and very significant if p-value <0.01. The results of the test are summarized in Table 9. CBO ­ CBO ­ CBO ­ SVM Naive Bayes Tree C4.5 P-value (F-measure) 0.00885858 0.00294464 0.000976562 Table 9: Wilcoxon test result. The p-values obtained with the Wilcoxon test are all less than 0.01. These are very significant p-values. This allows us to conclude that our system significantly improves the classification process of documents compared to conventional classifiers at the threshold .= 0.01. The three conventional classifiers have in common the representation of the documents by words independent of each other as well as a morphological comparison of the words belonging to the documents. The comparison is performed at the word level, whereas in our process, the comparison is performed at the overall context level of the document. A document is represented by the domain described in its content. This domain is deduced by the words of the document taken together considering their relationships in the context in which they appear. In addition, our process is built from domain ontologies, which is a more stable base than a training collection. Indeed, a modification in the choice of the documents constituting this training collection leads to a modification of the results of conventional classifiers. 4.2 Semantic similarity process of scientific abstracts 4.2.1 The data Our implementation was extended by adding processes to build the semantic perimeters, to divide scientific abstracts into three zones and to compare graphs. To evaluate our approach defining the semantic similarity of S. Iltache et al. scientific abstracts, we constructed an ontology representing the domain of automatic classification of documents. To construct our corpus, a set of scientific abstracts related to this domain was extracted from the web. In our different tests, the abstract, the title of the paper and the keywords were taken into account. Each abstract was compared with all the abstracts in the corpus. The abstracts were compared in pairs. For example, the results were obtained by comparing twenty abstracts for which 190 comparisons were made. The construction of the initial graph, the semantic perimeter of each abstract and the comparison of the graphs is done according to the process defined in the previous sections. Each concept of our ontology was annotated by one of the three selected zones characterizing the content of the scientific abstracts: context, contribution and application domain. This annotation is performed according to the role that each concept plays depending on the chosen domain. For example, clustering, classification and document concepts are annotated by the context zone, the concepts representing the different algorithms and methods used by the authors as well as all the concepts describing these methods are annotated by the contribution zone. The concepts representing the type of document (Text, Web) and the corpus used are annotated by the application domain zone. Our approach was compared to two existing approaches. The first approach is based on a vector representation of the content of the text: Bag-of-words. The process of extracting terms is similar to the one performed in section 4.1.1. An abstract vector contains the lemmas corresponding to the nouns, verbs and adjectives extracted from the text. Lemmas are represented by their weight based on Tf-Idf. The similarity of two abstracts is calculated by measuring the cosine of the angle between their respective vectors. The second n-grams approach is based on the representation of an abstract by a set of words called n-grams. The text is divided into a set of n-grams. The size of an n-gram is determined by a chosen number of consecutive characters, n. Several values of n were tested (n= 2, 4 and 8) and for each, the similarity between two abstracts was calculated using equation (7) [51] [52] and (8) [53]. For any pair of abstracts x and y, the similarity Sim(x,y) is computed as bellow : (7) (8) w denotes an arbitrary n-gram, fx(w) denotes the relative frequency with which w appears in the abstract x Text1 Text2 Similarity context contribution application domain global A1.clustering A3.clustering 1,000 0,401 1,000 0,622 A1.clustering A10.clustering 1,000 0,295 0,157 0,539 A1.clustering A2.clustering 0,982 0,306 0,065 0,538 A1.clustering A9.clustering 1,000 0,227 0,153 0,496 A1.clustering A16.clustering 1,000 0,169 0,065 0,458 A1.clustering A15.clustering 1,000 0,103 0,237 0,419 A1.clustering A17.clustering 1,000 0,092 0,345 0,415 A1.clustering A5.clustering 1,000 0,095 0,237 0,414 A1.clustering A18.clustering 1,000 0,022 0,353 0,371 A1.clustering A19. classif-clust 0,558 0,016 0,065 0,207 A1.clustering A14.classification 0,244 0,125 0,431 0,172 A1.clustering A6.classification 0,240 0,074 0,016 0,131 A1.clustering A8.classification 0,225 0,060 0,541 0,127 A1.clustering A7.classification 0,225 0,060 0,065 0,118 A1.clustering A4.classification 0,244 0,036 0,065 0,109 A1.clustering A11.classification 0,237 0,034 0,108 0,107 A1.clustering A13.classification 0,230 0,014 0,065 0,090 A1.clustering A12.classification 0,231 0,007 0,125 0,088 A1.clustering A20.classification 0,237 0,005 0,031 0,087 Table 10: similarities between A1 and the others abstracts. Text1 Text2 Similarity context contribution Application domain global A12.classification A13.classification 1,000 0,015 0,000 0,360 A12.classification A4.classification 0,966 0,032 0,000 0,358 A12.classification A20.classification 0,964 0,012 0,483 0,355 A12.classification A6.classification 0,965 0,015 0,193 0,351 A12.classification A14.classification 0,966 0,012 0,066 0,347 A12.classification A11.classification 0,964 0,007 0,023 0,342 A12.classification A8.classification 0,900 0,005 0,185 0,322 A12.classification A7.classification 0,900 0,005 0,000 0,318 A12.classification A19.classif-clust 0,541 0,107 0,000 0,257 A12.classification A5.clustering 0,234 0,027 0,329 0,105 A12.classification A3.clustering 0,234 0,032 0,125 0,105 A12.classification A18.clustering 0,234 0,026 0,125 0,101 A12.classification A17.clustering 0,234 0,024 0,123 0,100 A12.classification A9.clustering 0,227 0,019 0,189 0,095 A12.classification A15.clustering 0,231 0,004 0,329 0,090 A12.classification A1.clustering 0,231 0,007 0,125 0,088 A12.classification A2.clustering 0,233 0,005 0,000 0,085 A12.classification A16.clustering 0,231 0,006 0,000 0,085 A12.classification A10.clustering 0,227 0,004 0,032 0,083 Table 11: Similarities between A12 and the others abstracts. and Dn(x) represents the so called n-gram dictionary of The best results were obtained with n = 8 and x. | | is the number of n-grams. equation (8), for which the fewest erroneous matching was noted. 4.2.2 Results and discussion Parameter values (., ß and .) depend on the ontology and on the corpus used. Several values for these parameters were tested. The goal of this study is to attribute more importance to the context zone and the contribution zone since it aims to look for matches that primarily indicate documents dealing with the same context and similar contributions. The following values were retained: .= 0.35, ß = 0.63, . = 0.02, g = 0.05. These values led to the abstracts being grouped based on their context. Table 10 and Table 11 provide the results obtained when comparing respectively the abstracts A1 and A12 with the other abstracts. These tables provide the three partial similarities computed for each pair of abstracts as well as their global similarity. The results, ranked in descending order of global similarity, show a grouping of the abstracts by context. Abstract A1 deals with the document clustering context. Abstracts that have the highest similarity with A1 correspond to this context. The abstract A12 deals with the document classification context. Abstracts that have the highest similarity with A12 also correspond to this context hers abstracts. Table 10 provides a comparison of the similarities between A1 and the other abstracts at three levels. Their similarity can be compared at the context level, at the contribution level and at the application domain level. The values obtained comparing A1 with A3 indicate that these two abstracts deal with the same context (sim context = 1), present similar contributions (Sim contribution = 0, 401) and apply their approach to the same domain (sim application domain = 1). The value of their global similarity is high. These values enable us to retain these two abstracts as suspicious documents, thus requiring further reading and analysis of their entire contents. Table 11 provides a comparison of the similarities between A12 and the other abstracts at three levels. For the last ten rows of Table 11, very low partial and global similarities were obtained. The first eight rows of Table 11 show that the corresponding abstracts deal with the same context as abstract A12 (sim context >= 0.900) but use different approaches (sim contribution <= 0.032). Their global similarity is low (<= 0,360). This enables us to conclude that abstract A12 does not present any risk of plagiarism with the other abstracts. The goal of our approach is to be able to find suspicious documents; that is, documents with high similarities. To find these documents, a threshold for the calculated similarity values is determined by experimentation. To compare the results obtained with our approach to those of Bag-of-words and n-grams, similarities between the different abstracts of our corpus using the Bag-of-words and n-grams approaches were calculated. The abstracts were then ranked in descending order of their similarity. For these two approaches, several erroneous matching were found. Table 12, gives an example of the comparison of the similarities between A4 and the other abstracts obtained by our approach, and the Bag-of-words and n-grams approaches. A4 deals with the context classification. With Bag-of-word and n-grams approaches, most of the abstracts semantically closest to A4 deal with the clustering context. For the Bag-of-words approach, abstracts belonging to the context clustering (A10, A3, A2, A5, A15, A1) obtain a better similarity score than those (A11, A8, A12, A20, A7, A14) that deal with the same context that A4. It is the same for the n-grams approach. Abstracts Text1 Text2 Our approach Bag-of-words N-grams A4.classification A6.classification 0.417272 A06.classification 0.125685 A11.classification 0,042080 A 4.classification A11.classification 0.401363 A10.clustering 0.108323 A18.clustering 0,038287 A 4.classification A13.classification 0.373563 A13.classification 0.097182 A03.clustering 0,036313 A 4.classification A12.classification 0.358287 A19.classif-clust 0.095763 A06.classification 0,035757 A 4.classification A14.classification 0.358132 A03.clustering 0.092988 A10.clustering 0,035634 A 4.classification A7.classification 0.353878 A02.clustering 0.092751 A08.classification 0,035602 A 4.classification A20.classification 0.353120 A05.clustering 0.089178 A12.classification 0,034261 A 4.classification A8.classification 0.330633 A15.clustering 0.073636 A01.clustering 0,033475 A 4.classification A19.classif-clust 0.257688 A01.clustering 0.066826 A19.classif-clust 0,033400 A 4.classification A5.clustering 0.191517 A11.classification 0.061259 A07.classification 0,033071 A 4.classification A3.clustering 0.180843 A08.classification 0.045829 A17.clustering 0,032417 A 4.classification A9.clustering 0.176679 A18.clustering 0.043951 A09.clustering 0,029097 R4.classification A2.clustering 0.175801 A12.classification 0.042752 A15.clustering 0,026786 R4.classification A15.clustering 0.147094 A16.clustering 0.041947 A05.clustering 0,025901 A4.classification A10.clustering 0.135412 A20.classififcation 0.033817 A13.classification 0,025269 A4.classification A18.clustering 0.129238 A07.classification 0.031982 A14.classification 0,023015 A4.classification A17.clustering 0.119075 A17.clustering 0.028876 A02.clustering 0,020426 A4.classification A16.clustering 0.114507 A14.classification 0.026670 A16.clustering 0,018511 A4.classification A1.clustering 0.109055 A09.clustering 0.023351 A20.classification 0,015968 Table 12: Similarities between A4 and the others abstracts using our approach, Bag-of-words, and N-grams Abstracts P5 R-precision Bag-of-words N-grams Our approach Bag-of-words N-grams Our approach A1 1,000 1,000 1,000 0,800 1,000 1,000 A2 0,800 1,000 1,000 0,800 1,000 1,000 A3 0,800 1,000 1,000 0,800 0,900 1,000 A4 0,600 0,400 1,000 0,333 0,556 1,000 A5 0,800 0,600 1,000 0,900 0,800 1,000 A6 1,000 1,000 1,000 0,667 0,778 1,000 A7 0,800 0,800 1,000 0,778 0,667 1,000 A8 0,800 0,800 1,000 0,778 0,556 1,000 A9 0,800 1,000 1,000 0,900 0,900 1,000 A10 0,800 1,000 1,000 0,800 0,900 1,000 A11 1,000 1,000 1,000 0,778 0,889 1,000 A12 0,800 0,800 1,000 0,778 0,667 1,000 A13 0,800 0,800 1,000 0,667 0,667 1,000 A14 1,000 1,000 1,000 0,778 0,667 1,000 A15 0,800 1,000 1,000 0,800 1,000 1,000 A16 1,000 1,000 1,000 0,800 0,900 1,000 A17 0,600 1,000 1,000 0,700 0,900 1,000 A18 0,800 1,000 1,000 0,600 0,800 1,000 A19 1,000 1,000 1,000 1,000 1,000 1,000 A20 0,800 0,800 1,000 0,778 0,667 1,000 Average 0,840 0,900 1,000 0,762 0,811 1,000 Table 13: Precision values for, Bag-of-words, n-grams and our approach. belonging to the context clustering (A18, A3, A10, A1) obtain a better similarity score than those (A7, A13, A14, A20) that deal with the same context as A4. For all the comparisons made between the abstracts in the corpus, our approach is able to correctly rank the abstracts by context as shown in Table 10, 11, 12 and 13. Clustering and classification are two different contexts. For these two contexts, the methods and algorithms used are different. For that reason, the similarity between two abstracts belonging to these two contexts must be low (low context similarity and low contribution similarity) and, therefore, the risk of plagiarism is very low, or even non-existent. To determine which approach performs the correct matching between abstracts of our corpus, the precision P5 and the R-precision for each approach and for each abstract were computed. An abstract Ab1 is assumed relevant to an abstract Ab2, if Ab1 deals with the same context as Ab2. Precision Px at point x (x=5, R) is the ratio of the relevant abstracts among the first x returned ones. R in the R-precision represents the number of the relevant abstracts to a given abstract in the corpus. Table 13 summarizes the different values. Our process obtains better results than Bag-of-words and n-grams approaches. Our process is able to match correctly abstracts dealing with the same context and, therefore, it is more precise than the other approaches. The Wilcoxon Signed-Rank test was used in order to study the statistical significance of the improvement brought about by our process. The p-value between our system and the two other approaches was calculated. The results of the Wilcoxon test are summarized in Table 14. The p-values obtained with the Wilcoxon test are all less than 0.01. These are very significant p-values. This leads us to conclude that our system is able to match abstracts by context more correctly than the bag-of-word and n-grams approaches. Others results are summarized in Table 15. Our Our approach / approach / Bag-of-word n-grams P-value at P5 0.000213431 0.0089409 P-value at R-precision 0.0000638361 0.000219794 Table 14: Wilcoxon test result. -The content of abstracts A1, A2, A3 and A10 indicates great similarity between abstracts (A1-A3) and (A2-A10). These two pairs of abstracts deal with the same context, use the same algorithms and use ontologies to solve similar problematic a priori. As shown in Table 15, our approach makes it possible to select these abstracts as suspicious, while the Bag-of-words and n-grams approaches select only the abstracts (A1-A3). A1 and A3 use almost the same words in their content. As for the abstracts A2 and A10, their content is described with different words and different sentences, but both are interested in ontology-based feature selection and use the Table 15: Comparison between Bag-of-words, N-grams and our approach. Text1 Text2 Our approach Bag-of-Words N-grams context contribution Application domain global A1.clustering A3.clustering 1.000000 0.400673 1.000000 0.622424 0.724688 0,352187 A2.clustering A10.clustering 0.982456 0.486622 0.112994 0.652692 0.198869 0,050761 A15.clustering A16.clustering 1.000000 0.188889 0.000000 0.469000 0.470623 0,108580 same clustering algorithm. Our approach is able to capture the meaning of the abstract and, therefore, retains these two abstracts for a complete examination of their corresponding papers. -The Bag-of-words approach indicates a matching between abstracts A15 and A16. These two abstracts have a high similarity whereas the authors of these two abstracts use different methods in their contribution. Our approach has the advantage of comparing abstracts at three levels. For our approach, the contribution similarity between A15 and A16 indicates a very low value, which means that the methods used by the authors to solve their problematic are different. This makes it possible to conclude that even if these two abstracts present similar contexts, the risk of plagiarism is low. Our approach assesses the similarity of texts in two steps. The documents are first assigned to a domain ontology that best describes their content. This overall similarity is achieved by a semantic classification process. This process emphasizes the overall context of the document that can be deduced from the terms of the document taken together, unlike conventional classifiers that consider words independently of each other. For documents belonging to the same ontology, a "local" similarity is calculated. This similarity is based on graphs corresponding to the texts. The enrichment of the graphs through the construction of the semantic perimeter of the texts and comparing of their graphs makes it possible to deduce a similarity not explicitly cited in the texts. The similarity calculation of scientific abstracts is refined by dividing their contents into three zones. Partial similarity values are then calculated. This helps to bring out the notions common to both texts. A grouping by context and a ranking in descending order of the global similarity value can be achieved by combining the three partial similarities. The objective of our approach is to find suspicious documents. It has the advantage of comparing the content of the documents based on three levels. The examination of the similarity obtained for each zone makes it possible to conclude on the existence of a risk of plagiarism. Conclusion The approach proposed in this paper is meant to assess text similarity. This similarity is based on an overall similarity calculation obtained by a classification process. Our classification process is based on domain ontologies and takes into account the relationships between the terms relative to their context of appearance in the document. The evaluation of our process showed better results than those of conventional classifiers. The construction of the semantic perimeter and the comparison of the graphs of texts based on the domain ontology to which they are attached make it possible to enrich the graphs and to deduce implicit information. Our approach thus present the advantage of taking into account the synonymy and polysemy present in a language and of deducing a similarity between two texts not explicitly cited in their content. Assessing the similarity between the scientific texts represented by their abstracts is our main interest. In the process of semantic comparison, three distinct parts were defined to structure the abstracts of scientific texts: context, contribution and application domain and three partial similarities were calculated. The comparison of two scientific abstracts is then performed at three levels. The global similarity value of two abstracts, calculated by combining partial similarities, makes it possible to rank the abstracts in descending order of their global similarity. A threshold applied to the calculated similarities is useful in finding suspicious documents and highlighting a risk of plagiarism. Tests were performed on a set of scientific abstracts. The enrichment of the graphs makes it possible to bring out common notions not explicitly cited. Moreover, dividing the contents of abstracts into three distinct zones helps in extracting the notions relative to the context, contribution and application domain and thus makes comparisons between zones of the same type. An evaluation can be made to determine whether two abstracts deal with the same context, whether their contributions are similar and whether they apply their approach to the same application domain. The quality of our process depends on domain ontologies that must cover the entire vocabulary of the knowledge domain represented for the process to be effective. This may constitute a limitation of this work since the process used does not support the building of domain ontologies. It is, therefore, assumed that they are available. Even if this can be assumed for scientific texts or abstracts structured as shown in this work, the process obviously needs to be refined for it to be used in comparing general texts. Indeed, one of the ways of improving our approach is to generalize the concept of semantic perimeter so as to consider any text rather than just scientific abstracts. 6 References [1] P. Resnik (1999). Semantic similarity in a taxonomy: An information based measure and its application to problems of ambiguity in natural language. Journal of Aritificial Intelligence Research, Vol.11, Issue 1, pp. 95-130. https://doi.org/10.1613/jair.514 [2] J. Curran (2002). Ensemble methods for automatic thesaurus extraction. In Proceedings of the conference on Empirical methods in natural language processing (EMNLP), Philadelphia, Vol.10, pp. 222-229. [3] P. Cimano, S. Handschuh, and S. Staab (2004). Towards the self-annotating web. In Proceedings of the 13th international conference on World Wide Web, New York, USA, pp. 462-471. [4] Z.S. Harris (1954). Distributional structure. Word, Vol. 10, Issue 2-3, pp. 146–162. https://doi.org/10.1080/00437956.1954.11659520 [5] S. Deerwester, S. Dumais, G. Furnas, T. Landauer, and R. Harshman (1990). Indexing by latent semantic analysis. Journal of the American Society of Information Science, Vol. 41, Issue 6, pp. 391– 407. https://doi.org/10.1002/(SICI)1097­ 4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9 [6] E. Gabrilovich and S. Markovitch (2007). Computing semantic relatedness using Wikipedia-based explicit semantic analysis. In Proceedings of the 20th International Joint Conference on Artificial Intelligence, Hyderabad, India, pp. 1606–1611. [7] M.Yazdani and A. Popescu-Belis (2013). Computing text semantic relatedness using the contents and links of a hypertext encyclopedia: extended abstract. In Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence, Beijing, China, pp. 3185–3189. [8] J. Turian, L. Ratinov and Y. Bengio (2010). Word representations: a simple and general method for semi-supervised learning. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, Uppsala, Sweden, pp. 384–394. [9] M. Baroni, G. Dinu and G. Kruszewski (2014). Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, Volume 1, Baltimore, Maryland, pp. 238–247. [10] C. Leacock, G. A. Miller, and M. Chodorow (1998). Using corpus statistics and WordNet relations for sense identification. Journal of Computational Linguistics, Vol 24, Issue 1, pp. 147-165. [11] R. Rada, H. Mili, E. Bicknell and M. Blettner (1989). Development and application of a metric on semantic nets. IEEE Transactions on systems, Man and Cybernetics, Vol 19, Issue 1, pp.17-30. https://doi.org/10.1109/21.24528 [12] Z. Wu and M. Palmer (1994). Verb semantics and lexical selection. In Proceedings of the 32nd Annual Meetings of the Associations for Computational Linguistics, Las Cruces, New Mexico, pp. 133-138. https://doi.org/10.3115/981732.981751 [13] D. C. Howe (2009). RiTa: creativity support for computational literature. In Proceedings of the Informatica 42 (2018) 375–399 397 seventh ACM conference on Creativity and cognition (C&C '09), Berkeley, California, USA, pp. 205-210. [14] D. Lin (1998). An information-theoric definition of similarity. In Proceedings of the 15th international conference on Machine Learning, pp. 296-304. [15] P. Resnik (1995). Using information content to evaluate semantic similarity in a taxonomy. In Proceedings of the 14th International Joint Conference on Artificial Intelligence, Vol 1, Montreal, Quebec, Canada, pp. 448-453. [16] S. P. Ponzetto and M. Strube (2007). Knowledge derived from Wikipedia for computing semantic relatedness. Journal of Artificial Intelligence Research, Vol 30, Issue 1, pp. 181–212. https://doi.org/10.1613/jair.2308 [17] D. Milne, I. H. Witten (2008). Learning to link with Wikipedia. In Proceedings of the 17th ACM Conference on Information and Knowledge Management, California, USA, pp. 509–518. [18] M. T. Pilehvar and R. Navigli (2015). From senses to texts: An all-in-one graph-based approach for measuring semantic similarity. Journal of Artificial Intelligence Vol. 228, pp. 95–128. https://doi.org/10.1016/j.artint.2015.07.005 [19] G. Salton and M.J. McGill (1983). Introduction to modern information retrieval. McGraw-Hill computer Science Series. [20] G. Salton (1971). The SMART Retrieval System – Experiments in Automatic Document Processing. Prentice-Hall. [21] C.J. Crouch, S. Apte, et H. Bapat (2002). Using the extended vector model for xml retrieval. In Proceedings of the First Workshop of the Initiative for the Evaluation of XML Retrieval (INEX), Schloss Dagstuhl, pp. 95-98. [22] E.A. Fox (1983). Extending the Boolean and Vector Space Models of information retrieval with p-norm queries and multiple concept types. PhD thesis, Department of Computer Science, Cornell University. [23] D. Carmel, Y. Maarek, M. Mandelbrod, Y. Mass and A. Soffer (2003). Searching xml documents via xml fragments. In Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, Toronto, Canada, pp. 151– 158. https://doi.org/10.1002/asi.10060 [24] M. Fuller, E. Mackie, R. Sacks-Davis, and R. Wilkinson (1993). Structural answers for a large structured document collection. In Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval, Pitthsburgh, pp. 204–213. [25] T. Schileder and H. Meus (2002). Querying and ranking XML documents. Journal of the American Society for Information Science and Technology, Vol. 53, Issue 6, pp. 489–503. [26] T. Joachims (1997). A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization. In Proceedings of the Fourteenth International Conference on Machine Learning, Tennessee, pp.143-151. [27] S. Jaillet, A. Laurent and M. Teisseire (2006). Sequential patterns for text categorization. Journal of Intelligent Data Analysis, IOS Press, Vol.10, issue 3, pp.199–214. [28] P. Soucy, G. W. Mineau (2001). A Simple k-NN Algorithm For Text Categorization. In Proceedings of IEEE International Conference on Data Mining, San Jose, USA, pp.647–648. [29] A. Hotho, A. Maedche and S. Staab (2002). Ontology-based Text Document Clustering. KI, Vol. 16, Issue 4, pp. 48-54. [30] S. B. Kotsiantis (2007). Supervised Machine Learning: A Review of Classification Techniques. Informatica Vol. 31, Issue 3, pp. 249-268. [31] Y. Yang and X. Liu (1999). A re-examination of text categorization methods. In Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, Berkley, pp. 42–49. [32] T. Joachims (1998). Text categorization with support vector machines: learning with many relevant features. In Proceedings of ECML-98, 10th European Conference on Machine Learning, Chemnitz, Germany, pp. 137–142. [33] E. Gabrilovich and S. Markovitch (2005). Feature Generation for Text categorization Using World Knowledge. In Proceedings of IJCAI 2005: the Nineteenth International Joint Conference on Artificial Intelligence, Edinburgh, Scotland, UK, pp. 1048-1053. [34] A. Hotho, S. Staab and G. Stumme (2003). Ontologies Improve Text Document Clustering. In Proceedings of ICDM:3rd IEEE International Conference on Data Minin, Melbourne, FL, USA, pp. 541-544. [35] H. H. Tar and T.T. Soe.Nyunt (2011). Ontology-Based Concept Weighting for Text documents. International Conference on Information Communication and Management IPCSIT vol.16, IACSIT Press, Singapore. [36] B. Pincemin (2000). Similarites texte–texts expérience d’une application de diffusion ciblée et propositions. In Matemáticas y Tratamiento de Corpus, Actes du 2eme séminaire de l’Ecole interlatine de linguistique appliquée,San Millán de la Cogolla, Logroo, Espagne, Logroo : Fundacin San Millán de la Cogolla, 2002, pp 35-52. [37] K. Beyer, J. Goldstein, R. Ramakrishnan and U. Shaft (1999). When is `nearest neighbor' meaningful. In Proceedings of ICDT, International Conference on Database Theory, pp. 217-235. https://doi.org/10.1007/3-540-49257-7_15 [38] U.L.D.N. Gunasinghe, W.A.M. De Silva, N.H.N.D. de Silva, A.S. Perera, W.A.D. Sashika and W.D.T.P. Premasiri (2014). Sentence similarity measuring by vector space model. In Proceedings of the 14 th International Conference on Advances in ICT for Emerging Regions (ICTer), Colombo, Sri Lanka, pp. 185-189. S. Iltache et al. [39] Y. Liu, C. Sun, L. Lin, Y. Zhao and X. Wang (2015). Computing Semantic Text Similarity Using Rich Features. In Proceedings of PACLIC: 29th Pacific Asia Conference on Language, Information and Computation, Shanghai, China, pp. 44 – 52. [40] J. Lewis, S. Ossowski, J. Hicks, M. Errami and H. R. Garner (2006). Text similarity: an alternative way to search MEDLINE. Bioinformatics Vol. 22, Issue 18, pp. 2298–2304. https://doi.org/10.1093/bioinformatics/btl388 [41] E. Yamamoto, M. Kishida, Y. Takenami, Y. Takeda and K. Umemura (2003). Dynamic programming matching for large scale information retrieval. In Proceedings of the Sixth International Workshop on Information Retrieval with Asian Languages, Vol.11, Sapporo, Japan, pp. 100–108. https://doi.org/10.3115/1118935.1118948 [42] W. Ma and T. Suel (2016). Structural Sentence Similarity Estimation for Short Texts. In Proceedings of the Twenty-Ninth International Florida Artificial Intelligence Research Society Conference, Florida, pp. 232–237. [43] D. Dudognon, G. Hubert and B. Ralalason (2010). Proxigénéa : Une mesure de similarite conceptuelle. In Proceedings of the Colloque Veille Strategique Scientifique et Technologique (VSST 2010). [44] M. Baziz, M. Boughanem, H. Prade and G. Pasi (2005). A Fuzzy Set Approach to Concept-based Information Retrieval. In Proceedings of the 4th Conference of the European Society for Fuzzy Logic and Technology and the 11eme Eleventh Rencontres Francophones sur la Logique Floue et ses Applications (Eusflat-LFA 2005 joint Conference), Barcelona, Spain, pp. 1287–1292. [45] K. M. Shenoy, K.C. Shet, U.D. Acharya (2012). Semantic plagiarism detection system using ontology mapping. Advanced Computing: An International Journal (ACIJ), Vol.3, Issue 3, pp. 59– 62. [46] L. Zhang, C. Li, J. Liu and H. Wang (2011). Graph-Based Text Similarity Measurement by Exploiting Wikipedia as Background Knowledge. International Journal of Computer, Electrical, Automation, Control and Information Engineering Vol.5, Issue 11, pp. 1328–1333. [47] W. Jin and R. K. Srihari (2007). Graph-based Text Representation and Knowledge Discovery. In Proceedings of the 2007 ACM symposium on Applied computing, Seoul, Korea, pp. 807-811. https://doi.org/10.1145/1244002.1244182 [48] P. Wang, H. Zhang, B. Xu, C. Liu, and H. Hao (2014). Short Text Feature Enrichment Using Link Analysis on Topic-Keyword Graph. In Proceedings of Natural Language Processing and Chinese Computing, Springer, pp. 79–90. [49] J. Leskovec and J. Shawe-Taylor (2005). Semantic text features from small world graphs. Workshop on Subspace, Latent Structure and Feature Selection techniques: Statistical and Optimization perspectives, Bohinj. [50] S. Brin, J. Davis and H. Garcia-Molina (1995). Copy detection mechanisms for digital documents. In Proceedings of the 1995 ACM SIGMOD International Conference on Management of Data, San Jose, California, pp. 398–409. https://doi.org/10.1145/223784.223855 [51] C. Basile, D. Benedetto, E. Caglioti, and M. D. Esposti (2008). An example of mathematical authorship attribution. Journal of Mathematical Physics, Vol. 49, Issue 12, pp. 125211-1–125211­20. https://doi.org/10.1063/1.2996507 [52] C. Basile, D. Benedetto, E. Caglioti, G. Cristadoro and M. D. Esposti (2009). A plagiarism detection procedure in three steps: selection, matches and squares. 3rd Workshop on Uncovering Plagiarism, Authorship and Social Software Misuse, PAN 2009. [53] B. Stein, S.M. zu Eissen (2005). Near Similarity Search and Plagiarism Analysis. In Proceeding of the 29th Annual Conference of the GfKl Springer, pp. 430-437. [54] R. Lukashenko, V. Graudina and J. Grundspenkis (2007). Computer-Based Plagiarism Detection Methods and Tools: An Overview. In Proceeding of the 2007 International Conference on Computer Systems and Technologies -CompSysTech’07, Bulgaria, article N° 40. https://doi.org/10.1145/1330598.1330642 [55] K. Vani, D. Gupta (2015). Investigating the Impact of Combined Similarity Metrics and POS tagging in Extrinsic Text Plagiarism Detection System. In Proceeding of the International Conference on Advances in Computing, Communications and Informatics (ICACCI), Kochi, India, pp. 1578­1584. [56] A. H. Osman, N. Salim, M. S. Binwahlan, H. Hentably and A. M. Ali (2011). Conceptual similarity and graph-based method for plagiarism detection. Journal of Theoretical and Applied Information Technology, Vol. 32, Issue 2, pp. 135-145. [57] D. Rusu, B. Fortuna, M. Grobelnik and D. Mladenić (2009). Semantic Graphs Derived from Triplets with Application in Document Summarization. Informatica Vol.33, Issue 3, pp. 357–362. [58] S. Iltache, C. Comparot, M. Si Mohammed and P. J. Charrel (2016). Using domain ontologies for classification and semantic interpretation of documents. In Proceedings of ALLDATA 2016: 2nd International Conference on Big Data, Small Data, Linked Data and Open Data, pp. 76-81. [59] R. Bendaoud, (2009). Analyses formelle et relationnelle de concepts pour la construction d’ontologies de domaines a partir de ressources textuelles hétérogenes. PhD thesis, Henri Poincaré University, Nancy 1. [60] N. Fuhr and K. Grossjohann (2001). XIRQL: a query language for information retrieval in XML documents. In Proceedings of the 24th annual international ACM SIGIR conference on Research Informatica 42 (2018) 375–399 399 and development in information retrieval, New Orleans, Louisiana, USA, pp. 172-180. [61] E. Omodei, Y. Guo, J. P. Cointet and T. Poibeau, (2014). Analyse discursive automatique du corpus ACL Anthology. In : Actes de la 21eme conférence Traitement Automatique des Langues Naturelles, Marseille. [62] Y. Guo, A. Korhonen and T. Poibeau (2011). A Weakly-supervised Approach to Argumentative Zoning of Scientific Documents. In Proceedings of the 2011 conference on Empirical Methods in Natural Language Processing, Edinburgh, UK, pp. 273–283. [63] B. Magnini and G. Cavaglia (2000). Integrating Subject Field Codes into WordNet. In Proceedings of LREC-2000, Second International Conference on Language Resources and Evaluation, Athens, Greece, pp. 1413-1418. [64] C. Fellbaum (1998). WordNet: An Electronic Lexical Database. MIT Press, Cambridge MA. [65] K. Toutanova, D. Klein, C. Manning, and Y. Singer (2003). Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network. In Proceedings of HLT-NAACL, pp. 252-259. [66] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann and I. H. Witten (2009). The WEKA Data Mining Software: An Update. SIGKDD Explorations, Vol. 11, Issue 1. pp. 10-18. https://doi.org/10.1145/1656274.1656278 Improved Local Search BasedApproximation Algorithmfor Hard Uniform Capacitated k-Median Problem Sapna Grover, Neelima Gupta and AdityaPancholi Department of Computer Science, University of Delhi, India E-mail: sgrover@cs.du.ac.in, ngupta@cs.du.ac.in, apancholi@cs.du.ac.in Keywords:NP completeness, approximation algorithm, k-median problem Received: January 7, 2017 In this paper, we study the hard uniform capacitated k -median problem.Wegive (5 + ) factor approxi­mationforthe problemusing local search technique, violating cardinalitybyafactorof 3. Though better results are known for the problem using LP techniques, local search algorithms are well known to be sim­pler. Thereisa trade-offviz-a-viz approximationfactor andcardinality violation between our result and the result ofKorupolu et al. [10] which is the only result known for the problem using local search. They gave(1 + .) approximationfactor with (5 + 5/.) factor loss in cardinality. In a sense, our result is an improvementastheyviolatethe cardinalityby morethanafactorof 6 to achieve 5 factor in approximation. Thoughintheirresult,the approximationfactorcanbemade arbitrarilysmall, cardinalitylossisatleast 5 and small approximationfactoris obtainedatabiglossin cardinality. Thus,weimproveupon their result with respect to cardinality. Povzetek: ObravnavanjeNP problem optimiranja iskanjakmedianin predlagana izvirna rešitev,ki dosega boljše rezultate v doloˇ cenih primerjavah. 1 Introduction k -Median Problem is one of the well studied NP-hard optimization problem. The input instance consists of a set of clients,a setoffacilities,a non-negative number k and a non-negative costof connectingafacilitytoa client. The goal is to select a set of at most k facilities as centers and assign clients to them such that the total cost of serving the clients from centers is minimum. Several versions of the problem exist in literature with different properties, the most common being Un­capacitated k Median Problem (UkM) and Capacitated k Median Problem (CkM). In the former case, eachfacility has infnite capacity(i.e. there is no limit on the amount of demand it can serve) in comparison to fnite capacity in the latter case. In CkM, capacities may be soft or hard. In soft capacitated version, multiple copies of a facility can be opened at a location whereas in case of hard capacities, eachfacilityis either openedat some locationor not. Also, the capacities may be uniform or non-uniform. In the for­mer case, allfacilitieshave thesame capacityin contrast to the latter one where-in differentfacilities have different capacities. Another variation of CkM is with respect to assignments of clients tofacilities: in un-splittable assign­ments, the entire demand of a client has to be served by only onefacility, in comparison to splittable assignments in which the demand of a client can be split among multi-plefacilities. Several techniques have been used to obtain results for the problem. One of the most widely used technique to ap­proximate the problem is LP Rounding ([4, 5, 7, 8, 9, 11, 12, 13, 14]). Charikar et al. [7]gave a 20/3 factor approxi­mation algorithm for UkM, which was further improved to 3.25 factorby CharikarandLiin[8].LiandSvensson[14] . further improved the ratio to 1+ 3+ . Their algorithm (1/2 hasa running timeofO(n)). Obtaining a constant approximation factor for CkM problem without violating capacity constraint and cardinal­ity constraint is challenging as natural LP of the problem is known to have an unbounded integralitygap. Approxima­tion results violate either capacity constraint or cardinality constraint, or both. Cardinality violation: Li [12]gave a novel linear pro­gram called rectangle LP and presented an improved ap­proximation algorithm(exp(O(1/2)))using at most(1 + )k facilities for hard uniform CkM problem. The running time of the algorithm is nO(1), where the constant in the exponent does not depend on . He then extended this re­sult to non-uniform soft capacitated variant of the problem in [13] andgave an (O(1/2 log(1/))) approximationfac-tor bounding softnessbyafactorof 2. The algorithm has a O(1/) running time of n. Capacity violation: Charikar et al. [7]gavea 16 factor approximation algorithm for hard uniform CkM violating capacities by a factor of 3 in case of splittable demands and 4 in case of un-splittable demands. In 2015, Byrka et al. [4]gave an O(1/) approximation algorithm violat­ing capacitiesbyafactorof (3 + ) for hard non-uniform CkM. Demirci et al. [9] improved the approximation ratio to O(1/5) with capacity violation of (1 + ) for the same version of the problem. The running timeof their algorithm is nO(1/). Recently, Byrka et al. [5]gave an O(1/2) ap­proximation violating capacitiesbyafactorof (1 + ) for hard uniform CkM. The algorithm uses randomized round­ing to round a fractional solution to the confguration LP. Aardal et al. [1] exploited the structure of an extreme point solutiontogivea(7+)factor algorithm for hard non­uniform Capacitated k-Facility Location Problem (Ck-FLP) violating cardinality constraintbyafactorof2. As a special case of CkFLP, their result applies on hard non­uniform CkM with allfacility costs being zero. In the same manner,the CkFLP result(1/2)of Byrkaet al. [4] is appli­cable on hard uniform CkM. The result violates capacities by afactor of 2+ . The other commonly used technique for the problem is local search[2,6,10]. Charikarand Guha[6]gave4factor algorithm without violating cardinality constraint for the un-capacitatedvariantof the problem.Korupolu et al. [10] gaveO(1+ )factor approximation algorithm for UkM us­ing at most 3+5/ facilities. Aryaet al.[2]gave an impro­vised result of 3+2/p factor algorithm for UkM by using p-swaps. We presenta(5+ )factor algorithm for hard uniform CkM violating the cardinality by a factor of 3 using Lo­cal Search. Algorithms based on local search are well known to be simpler as compared to the LP-based algo­rithms. The only result known for the problem using local search is due toKorupolu et al. [10]. They give an algo­rithm with a trade-off between approximation factor and cardinality loss. They give (1 + .) approximationfactor with (5 + 5/.) factor lossin cardinality.To achieve5 fac­tor in approximation, cardinality violation is more than 6. Though the approximation factor can be made arbitrarily small, cardinality loss is at least 5. Note that small approx-imationfactoris obtainedatabig lossin cardinality. For example, for . anything less than 1, cardinality violation is more than 10. Though we somewhat loose on the ap­proximationfactor, we surely improve upon the cardinality violation. Thus, there is a trade-offbetween cardinality vi­olation and approximationfactor amongst their result and ours. In particular, we present the following result: Theorem 1. There is a polynomial time algorithm that approximates harduniform capacitated k median problem within 5 factor violating the cardinality by a factor of 3. High Level Idea: We extend the idea of ‘mapping’ of Arya et al. [2] to the capacitated version of the problem. However, for the capacitated case, mapping needs to be done a little intelligently. Mapping to an almost fully uti­lizedfacilitymaynotbeableto accommodateallthe clients mapped to it and vice-versa. That is, a partially utilized facility may not be able to accommodate the load of an almost fully utilizedfacility. Thus, mapping is done only between the partially utilizedfacilities.Toensure that there aresuffcient numberof partially utilizedfacilities,weneed to assume that we have suffcient number(3k)of opened centers. 2 Notation and preliminaries 2.1 Capacitated k-median problem In Capacitated k-Median Problem, we are given a set of F of facilities, a set C of clients and a real valued distance function c on F.C in metric space. Each client j .C has a non-negative demand dj and eachfacility i .F has a capacity ui indicating the amount of demand it can serve. The cost of serving one unit of demand of a client j .C fromfacility i .F is denoted as c(i, j). The goal is to select a subset S .F of at most k facilities and assign clients to them without violating the capacities such that the total costof servingallthe clientsbythe openedfacilities is minimum. We consider the hard uniform capacitatedk-medianver­sion of the problem i.e. ui = U .i .F and at most one instanceofafacility canbe openedatits location.We as­sume unit demand at each client i.e. dj =1 .j .C. 2.2 Local search paradigm Given a Problem P, let S be anyarbitrary feasible solution to it. A new solution S0 is called a neighborhood solution of S if it can be obtained by performing local search operations such as adding one or morefacilities s . /S to S or deleting one or morefacilities s from S or swapping one or morefacilities of S withfacilities notin S. We now formally describe the steps of the algorithm. The paradigm: 1. Compute an arbitrary feasible solution S to P. 2. While S0 is a neighborhood solution of S such that cost(S0) < cost(S) do, set S = S0 . The solution S so obtained is called a locally optimal solution. Note that cost(S0) . cost(S) for every neigh­borhood solution S0, for otherwise S would not have been locally optimal. More formally,asolution S is said to be lo­cally optimal if no further operation results in improvement in cost. 3(5+ , 3)algorithm For the k-median problems, we defne an (a, b) ­approximation algorithm as a polynomial-time algorithm that computes a solution using at most bk number offacili-ties with cost at most a times the cost of an optimal solution using at most k facilities. We select an arbitrary set offacilitiesS.F such that |S| =3k. This set acts as our initial feasible solution. Note that, defninga subsetof openedfacilities completely spec­ifes a solution. We can obtain the assignments by solving an appropriately defned instance of transportation prob­lem. The only operation permitted by our algorithm is swap(s, o), defned as follows: S = S-{s} + {o}, o . F\S, s .S. Reassign all the clients served by o in optimal to o in our new solution. We run the local search algorithm onS. Since S is now locally optimal, for all neighborhood solutions S0 of S, we have, cost(S0) . cost(S). 3.1 Analysis Let O denote the optimal solution to the problem.We now show that the local optimal solution S is within 5 factor of the optimal solution i.e. cost(S) . 5cost(O). For a clientj, let .S (j) and .O(j) denote thefacilities serving j in S and O respectively. Also, let Sj and Oj denote the service costs paid by j in S and O respectively. Let s .S and o .O. Consider Figure 1. Let BS(s) denote the ball of s, that is, the set of clients served by s in S. Similarly, let BO(o) denote the ball of o .O. Also, let B(s, o) be the set of clients served by s .S and o .O. Figure1: Ballsoffacilities To deal with capacities, we classify the facilities in S basedonthe numberof clients servedby them.Afacility s .S is said to be heavy if it serves more than U/2 clients in S, else it is said to be light. Note that the number of heavyfacilities canbeat most 2k. Let SL denote the set of lightfacilities in S. Since |S| =3k, |SL|. k. Let BL (o) be the set of clients served by o in optimal O andby lightfacilitiesin S and Mo = |BL (o)|. We say O that afacility s .SL dominates o, if it serves more than halfthe clients servedbylightfacilitiesin S andby o .O, i.e. B(s, o) > Mo/2.Afacility belonging to SL is called bad ifit dominates more than onefacilitiesin O, it is called good if it dominates exactly one facility in O, else it is called nice We now devise a1 - 1 and onto mapping . : BL (o) › O BL O(o) as j0,j1, ..., jMo O(o). Order the clients in BL -1 such that for every s . S with a nonempty B(s, o), the clients in B(s, o) are consecutive; that is, there exists r, s, 0 . r . s .Mo - 1, such that B(s, o)= {jr, ..., js}. De­  fne . (jp)=(jq), where q =(p + Mo/2) modulo Mo. Consider Figure 2a which shows the set BO(o). The corre­sponding mapping is shown in Figure 2b. The following claim holds on mapping: Claim 1. If s .SL does not dominate o, then .(B(s, o)) . B(s, o)= .. Proof. For contradiction, assume that both jp, . (jp)= jq .B(s, o) for some s, where |B(s, o)|.Mo/2. If  q = p + Mo/2 , then |B(s, o)|. q - p +1 =   Mo/2 +1 > Mo/2. If q =  p + M o/2 -Mo, then |B(s, o)|. p - q +1= Mo -Mo/2 +1 > Mo/2. In either case, we have a contradiction, and hence mapping . satisfes the claim. Figure 2: Mapping The notion of dominate can be used to construct a bipar­tite graph H =(S, O,E).For eachfacilityin SL, we have a vertex on the S-side and for eachfacilityin O, we have a vertex on the O-side.We add an edge between s .SL and o .O if s dominates o. Note thatthedegreeof eachvertex on O-side is at most one while the vertices on the S-side can have degree up to k. We now consider allk swaps, one for eachfacilityin O. If s .SL is good, then we consider the swap(s, o), where o is thefacilityin O dominated by s. Let . be the number offacilities in O that did not participate in the aboveswaps. Then the total numberof bad and nicefacilitiesin SL is at least . and at least ./2 of them must be nice. The remain­ing . facilities inO get swapped with the nicefacilitiesin SL such that each nicefacilityis consideredinat mosttwo swaps. The badfacilities are not considered for swapping. The swaps considered above satisfy the following proper­ties: 1. Each o .O is considered in exactly one swap. 2. Facilities in S\SL are not considered in any swap operation. 3. Bad facilities in SL are not considered in any swap operation. 4. Each nicefacility s .SL is considered in at most two swap operations. 5. If swap(s, o) is considered then s does not dominate 0 anyfacility o60 .O. = o : o Lemma 1. Let cost(S) denote the cost of the local opti­mal solution S and, cost(O) denote the cost of the global optimal solution O. Then, cost(S) . 5cost(O). Proof. Consider swap(s, o). Let j .BS(s).We frst reas­sign the clients in BS(s). 1. If j .BO(o), assign j to o. 2. If j/.BO(o), assign j to s0 .SL such that .(j)= j0 and j0 .BS(s0). In case 1, the change in cost is given by (Oj - Sj ). In case 2, the change in cost is (c(j, s0) - Sj ). Let j .BO(o0). From triangle inequality, we get c(j, s0) . c(j, o0)+ c(o0,.(j)) + c(.(j),s0)= Oj + O.(j) + S.(j). As S is a locally optimal solution, we have X (Oj - Sj )+ j.BS (s).BO(o) X (Oj + O.(j) + S.(j) - Sj ) > 0 (1) j.BS (s)\BO(o) Eachfacility o .O is considered in exactly one swap operation. Thus the frst term of inequality when added over all k swaps gives exactly cost(O) - cost(S). Each s .S is considered in at most two swaps. The second term of inequality when added over all k swaps is no greater than 2(Oj + O.(j) + S.(j) - Sj ). As . is a 1 - 1 and onto XX X mapping, Oj = O.(j) and (S.(j) - Sj )=0. j.C j.C j.C Thus, 2(Oj +O.(j) +S.(j) -Sj)=4cost(O). Combining the two terms, we get cost(O) - cost(S)+4cost(O) . 0. Thus, cost(S) . 5cost(O). Inthe algorithm presentedsofar,wemovetoanew so­lution if it gives some improvement in the cost, however small that improvement may be. This may lead to an algo­rithmtakinglotoftime.To ensurethatthe algorithmtermi­nates in polynomial time, a local search step is performed only when the cost of the current solution S is reduced by at least cost(S) , where n is the size of the problem instance p(n,) and p(n, ) is an appropriate polynomial in n and 1/ for a fxed > 0. This modifcation in the algorithm incurs a cost of additive  in the approximationfactor. It is easy to see that if we have 3.5k facilities then the total numberof bad and nicefacilitiesin SL is at least . + k/2 and at least (. + k)/2 . . of them must be nice. The remaining . facilities inO get swapped withthe nice facilities in SL such that each nice facility is considered in at most one swap. This saves usfactor 2 coming from the second term of equation (1). Thus, we get(3+ , 3.5) algorithm. Also, using p-swaps of Arya et al. [2], we can get(3+2/p, 3)algorithm. 4 Conclusion and future work Wegave a(5 + ) factor approximation algorithm for hard uniform capacitated k median problem using local search technique, violating cardinalitybyafactorof 3. It improves upon theexisting results known for the problem using local search, with respect to cardinality violation. It would be interesting to obtain a constant factor algorithm reducing the cardinality violation to(1+ ). Though such a result is known using LP-techniques, it would be interesting to obtain similar result using local search. Another direction to extend the work would be to consider the non-uniform capacitated version of the problem using local search. References [1] Karen Aardal, Pieter L. van den Berg, Dion Gijswijt, and Shanfei Li. Approximation algorithms for hard capacitated k-facility location problems. European Journal of Operational Research, 242(2):358–368, 2015. https://doi.org/10.1016/j.ejor.2014.10.011 [2] Vijay Arya, Naveen Garg, Rohit Khandekar, Adam Meyerson, Kamesh Munagala, and Vinayaka Pan-dit. Local search heuristic for k-median and fa­cility location problems. In Proceedings on 33rd Annual ACM Symposium on Theory of Comput­ing, Heraklion, Crete, Greece, pages 21–29, 2001. https://doi.org/10.1145/380752.380755 [3] Manisha Bansal. Approximation algorithms forfacil­ity location problems-https://drive.google.com/fle/d/ 0bxmghjb2ede3nkvka2rzcfnethm/view. In PhD The­sis, October, 2013. [4] Jaroslaw Byrka, Krzysztof Fleszar, Bartosz Rybicki, and Joachim Spoerhase. Bi-factor approximation al­gorithms for hard capacitated k-median problems. In Proceedings of theTwenty-Sixth AnnualACM-SIAM Symposium on Discrete Algorithms, SODA2015, San Diego, CA, USA,January 4-6, 2015, pages 722–736, 2015. https://doi.org/10.1137/1.9781611973730.49 [5] Jaroslaw Byrka, Bartosz Rybicki, and Sumedha Uniyal. An approximation algorithm for uniform ca-pacitated k-median problem with(1+ )capacity vi­olation. In Integer Programming and Combinatorial Optimization -18th International Conference, IPCO 2016, Liege, Belgium, June 1-3, 2016, Proceedings, pages 262–274, 2016. https://doi.org/10.1007/978-3­319-33461-5_22 [6] Moses Charikar and Sudipto Guha. Improved com­binatorial algorithms for thefacility location and k-median problems. In Proceedings of the 40th Annual IEEE Symposium on Foundations of Computer Sci­ence (FOCS), New York, NY, USA, pages 378–388, 1999. https://doi.org/10.1137/s0097539701398594 [7] Moses Charikar, Sudipto Guha, Éva Tardos, and David B. Shmoys. A constant-factor approximation algorithm for the k-median problem (extended ab­stract). In Proceedings of the Thirty-First Annual ACM Symposium on Theory of Computing, May 1­4, 1999, Atlanta, Georgia, USA, pages 1–10, 1999. https://doi.org/10.1145/301250.301257 [8] Moses Charikar and Shi Li.Adependent lp-rounding approach for the k-median problem. In Automata, Languages, and Programming -39th International Colloquium, ICALP 2012, Warwick, UK, July 9-13, 2012, Proceedings, Part I, pages 194–205, 2012. https://doi.org/10.1007/978-3-642-31594-7_17 [9] H. Galp Demirci and Shi Li. Constant approxima­tion for capacitated k-median with (1 + )-capacity violation. In Proceedings of the 43rd International Colloquium onAutomata, Languages, and Program­ming, ICALP 2016, July 11-15, 2016, Rome, Italy, pages 73:1–73:14, 2016. [10] Madhukar R. Korupolu, C. Greg Plaxton, and Rajmohan Rajaraman. Analysis of a local search heuristic for facility location problems. Journal of Algorithms, 37(1):146–188, 2000. https://doi.org/10.1006/jagm.2000.1100 [11] Shanfei Li. An Improved Approximation Algorithm for the Hard Uniform Capacitated k-median Problem. In Approximation, Randomization, and Combinato­rial Optimization. Algorithms and Techniques (AP­PROX/RANDOM 2014), Germany, pages 325–338. https://doi.org/10.1016/j.ejor.2014.10.011 [12] Shi Li. On uniform capacitated k-median be­yond the natural LP relaxation. In Proceedings of the Twenty-Sixth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2015, San Diego, CA, USA,January 4-6, 2015, pages 696–707, 2015. https://doi.org/10.1137/1.9781611973730.47 [13] Shi Li. Approximating capacitated k-median with (1 + )k open facilities. In Proceedings of the Twenty-Seventh Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2016, Arlington, VA, USA, January 10-12, 2016, pages 786–796, 2016. https://doi.org/10.1137/1.9781611974331.ch56 [14] Shi Li and Ola Svensson. Approximating k-median via pseudo-approximation. In Proceedings of Symposium on Theory of Computing Conference (STOC),Palo Alto, CA, USA, pages 901–910, 2013. https://doi.org/10.1145/2488608.2488723 EffcientTrajectory Data PrivacyProtection Scheme Based on Laplace’s Differential Privacy Ke Gu†,‡,±, LihaoYang†,Yongzhi Liu†, BoYin† † Schoolof Computer&Communication Engineering Changsha Universityof Science&Technology, Changsha 410114, China ‡ HunanProvincialKeyLaboratoryof Intelligent ProcessingofBigDataonTransportation Changsha Universityof Science&Technology, Changsha 410114, China ± School of Information Science and Engineering, Central South University, Changsha 410083, China Keywords:trajectory data, polygon, centroid, Laplace’s differential privacy Received: May 25, 2017 Now manyapplicationsof trajectory (location) datahavefacilitated people’s daily life.However, publish­ing trajectory data may divulge individual sensitive information so as to infuence people’s normal life. On theotherhand,ifwe cannotmineandshare trajectorydata information, trajectorydatawillloseitsvalueto serve our society. Currently, because the records of trajectory data are discrete in database, some existing privacyprotection schemesarediffculttoprotect trajectorydata.Inthispaper,weproposeatrajectorydata privacyprotection scheme based on Laplace’s differential privacymechanism. In the proposed scheme, the algorithm frst selects the protected points from the user’s trajectory data; secondly, the algorithmbuilds the polygons according to the protected points and the adjacent and high frequent accessed points selected from the accessed point database, then the algorithm calculates the polygon centroids; fnally, the noises are added to the polygon centroids by the Laplace’s differential privacy method, and the new polygon centroids are used to replace the protected points, and then the algorithm constructs and issues the new trajectory data. Theexperiments show that the running timeof the proposed algorithmsisfast, theprivacy protection of the scheme is effective and the data usability of the scheme is higher. Povzetek: Predlaganajemetoda zauˇcne cinkovitovarovanje podatkovo poteh na osnovi Laplacove diferenˇprivatnosti. 1 Introduction 1.1 Background With the rapid development of computer and network, data mining and analysis plays an increasingly important role in our social life. The huge amounts of data (such as big data) can bring many application services to our society, such as trajectory(location) data, health and food data, traf­fc safety data, etc. Trajectory data is a kind of position information with large scale, fast changing and generally accepted characteristics, which mainly comes from vehi­cle networks, mobile devices, social networks and so on. Now many applications of trajectory data havefacilitated people’s daily life, thus trajectory data service is called as a kind of new mobile computing service. Currently, it is the key of developing trajectory data services that we must be able to learn and understand position information [1]. However, trajectory data is mainly collected and dis-seminatedby mobile equipments,but manymobiledevices and mobile communication technologies must integrate ge­ographical data and individual information into trajectory data, such as individual information may contain individ­ual privacy data, personal health status, social status and behavior habits, etc, thus mining and publishing trajectory data may divulge individual sensitive information so as to infuence people’s normal life [2,3,4]. Now it is the key of trajectory data privacy protection that howto protect sensitivetrajectory data while providing trajectory information service on data mining. For exam­ple, if mined data is not processed and protected on fully open status, mined data may reveal user’s privacy so as to affect user’s normal life. Thus, it is double-edged sword that how to mine and use trajectory data. Namely we must fnda compromising approach between service and protec­tion. However, many existing privacyprotection schemes cannot provide the balance of utility and protection. For example, the generalization method [5] cannot availably protect data, and the anonymous grouping method [6] is not effcient enough. Furthermore, because the records of trajectory data are discrete in database1, some existing pri­vacy protection schemes are diffcult to protect trajectory data. Therefore, we focus on fnding an effcient privacy protection scheme for trajectory data in this paper. 1In real world, trajectory data may not be discrete. In this paper, our focus is the combination of location data and accessed frequency, thus we consider that the records of trajectory data are discrete. 1.2 Our contributions In this paper, we propose a trajectory data privacyprotec­tion scheme based on Laplace’s differential privacymech­anism. In the proposed scheme, the algorithm frst selects the protected points from the user’s trajectory data; sec­ondly, the algorithmbuilds the polygons according to the protected points and the adjacent and high frequent ac­cessed points selected from the accessed point database, then the algorithm calculates the polygon centroids; f­nally, the noises are added to the polygon centroids by the Laplace’ differential privacy method, and the new poly­gon centroids are used to replace the protected points, and then the algorithm constructs and issues the new trajectory data. The experiments show that the running time of the proposed algorithms is fast, the privacy protection of the scheme is effective and the data usability of the scheme is higher. 1.3 Outline The rest of this paper is organized as follows. In Section 2, we discuss the related works about trajectory data privacy protection. In Section 3, we review the related defnitions and theorems on which we employ. In Section 4, we pro­pose an effcient trajectory data privacyprotection scheme, which is based on the Laplace’s differential privacymech­anism. In Section 5, we analyze and show the effciencyof the proposed scheme by the experiments. Finally, we draw our conclusions in Section 6. 2 Related work Currently many privacy protection schemes are being widely used in many felds, such as secure communica­tion, social network, data mining and so on. The works [5,6] frst proposed the k-anonymity model to protect so­cial network, whose anonymity protection methods mainly include generalization [7,8], compression, decomposition [9], replacement [10] and interference. Based on theworks of [5,6], manyother k-anonymous protection methods [11­ 21] were also proposed. However, the works [20,21,22] proved that some anonymous protection methods cannot protect sensitive data very well. Additionally, Cristofaro et al. [23] proposed a privacy-encrypted protection scheme. Although their scheme can ensure data security, data util­ity is decreased. Current location data privacy protection methods [1,24] are mainly classifed to three categories: the heuristic privacy-measure methods, the probability-based privacyinference methods and the privacyinformation re­trieval’s methods. The heuristic privacy-measure meth­ods [25,26,27,28] are mainly to provide the privacy pro­tection measure for some no-high required users, such as k-anonymity [25], t-closing [26], m-invariability [27] and l-diversity [28]. Also, although the information retrieval’s privacy protection methods can achieve perfect privacy protection, there are more or less privacy information in the released data, so these methods may result in that no data can be released, and these methods have high over­head. Additionally, the probability-based privacyinference methods can protect data and achieve better data utility un­der certain conditions, but the effectiveness of the meth­ods depends on original data availability. Further, the three kinds of methods are based on a unifed attack model [1], which depends on certain background knowledge to protect location data. However, with the increase of background knowledgegotbytheattackers,these methodscouldnotal-ways effectively protect location data. The works [5,6,11­ 19] showed the shortages of the relationship-privacy pro­tection methods. Ting et al. [29] analyzed a variety of privacy threat models and tried to optimize the effective­ness of the data obtained while preventing different types of reasoning attacks. Bugra et al. [30] proposed the frst effective location-privacy preserving mechanism (LPPM) that enablesa designerto fndthe optimal LPPMforaLBS (location-based service) given user’s service quality con­straints against an adversary implementing the optimal in­ference algorithm. Such LPPM is the one that maximizes the expected distortion (error) that the optimal adversary incurs in reconstructing the actual location of a user, while fulflling the user’s service-quality requirement. Presently, itisthekeyof protecting location datatoprovideaprivacy protection method not sensitively to background knowl­edge. Based on the requirement, differential privacy pro­tection technology can exactly satisfy it. Differential pri­vacyis a kind of strong privacyprotection method, which is not sensitive to background knowledge. However, be­cause location data has the characteristics of sparsity and farrago, many differential privacy protection methods are not enough effcient. He et al. [31] proposed a synthetic system based on GPS path, which can provide strong dif­ferential privacyprotection mechanism. The proposed sys­tem gets different speed trajectory by using a hierarchical reference method to isolate the original trajectory, and then protects the speed trajectory. Chatzikokolakis et al. [32] proposed a predictive differentially-private mechanism for location privacy, which can offer substantial improvements over the independently applied noise. Their works showed that correlations in the trace can be in fact exploited in terms of a prediction function that tries to guess the new location based on the previously reported locations. Ad­ditionally, their works tested the quality of the predicted location using a private test; in case of success the predic­tion is reported otherwise the location is sanitized with new noise. Chatzikokolakis et al. [33] also showeda formal no­tion of privacythat protects the user’s exact location–“geo­indistinguishability", and then proposed two mechanisms to protect the privacyof user when dealing with location-based services. Alsotheyextended their mechanismstothe case of location traces, and provided a method to limit the degradation of the privacy guarantees due to the correla­tion between the points. Li et al. [34] proposed a compres­sive mechanism for differential privacy, which is based on compressed sensing theory. Their mechanism is to consider every data as a single individual, so it undermines the rela­tionship of data so as to be not suitable to protect location data. Jia et al. [1] proposed a differential privacy-based transaction data publishing scheme. Their method estab­lishes the relationship of transaction data items by a query treeandaddsnoisestothequerytreebasedonthe compres­sive mechanism and the Laplace’s mechanism. However, it is diffcult to measure the effectiveness of their method on privacy protection. Zhang et al. [35] proposed an accu­rate method for mining top-k frequent data records under differentialprivacy.In their scheme,theexponential mech­anism is used to sample top-k frequent data records, and then the Laplace’s mechanism is utilized to generate noises to distort original data. Although the effectiveness of their method may accurately be measured on privacyprotection, their method neglects the relationship of transaction data items. 3 Differential privacy Differential privacyprotection can achieve privacyprotec­tion target by making data distortion, where the common approach is to add noises into queried results. The pur­pose of differential privacy protection is to minimize pri­vacy leakage and to maximize data utility [36,37]. Cur­rently differential privacy protection has two main meth­ods [38,39]—the Laplace’smechanism and theexponential mechanism. DWork et al. [39] proposed a protection method for the sensitivity of private data, which is based on the Laplace’s mechanism. Their method distorts the sensitive data by adding the Laplace’s distribution noises to the original data. Their method may be described as follows: the algo­rithm M is the privacy protection algorithm based on the Laplace’s mechanism, the set S is the noise output set of the algorithm M, and the input parameters are the data set D, the function Q, the function sensitivity .Q and the pri­vacyparameter ., where the set S approximately subjects to the Laplace’s distribution(.Q )and the mean (zero), as . shown in the formula (1):  . Pr [M(Q, D)= S] . exp ·| S - Q(D) |(1) .Q 1 Also, in their method, the probability density function of added noise subjecting to the Laplace’s distribution is as the formula (2): -|x| 1 . Pr(x, .)= · e (2) 2·. where . = .Q , namely the added noise is independent . from the data set, and is only related to the function sen­sitivity and the privacyparameter. The main idea of their method adds the noises subjecting to the Laplace’s distri­bution into the output result so as to distort the sensitive datatoachievedata protectiontarget.Forexample,intheir method, let Q(D) be the querying function of top-k ac­cessing count, then the output of the algorithm M can be represented by the following formula (3): M(Q, D)=  Q(D)+ Lap1( .Q ), Lap2( .Q ), ..., Lapk( .Q ) (3) .. . where Lapi( .Q )(1 . i . k) is each round of the inde- . pendent noise subjecting to the Laplace’s distribution, and the noise is proportional to .Q and inversely proportional to .. Defnition 3.1 .-Differential Privacy: Given two adja­cent data sets D and D0 where at most a data record is different between D and D0 (|D = 6D0| =1), for any algorithm M, whose range is Range(M), if the result S outputted by the algorithm M satisfes the following for­mula (4) on the two adjacent data sets D and D0 (S . Range(M)), then the algorithm M satisfes .-differential privacy. . Pr [M(D) . S] . e· Pr [M(D0) . S] (4) Pr represents the randomicity of the algorithm M on D and D0, namely denotes the risk probability of privacydis-closure. . represents the privacyprotection level, where if . is bigger, then privacyprotection degree is lower; on the contrary, if . is smaller, then privacy protection degree is higher. Defnition 3.2 Data Sensitivity2: Data sensitivity is di­vided to global sensitivity and local sensitivity, we set Q as query function, then the global sensitivity of the func­tion Q is defned as follows: .Q = max {| Q(D) - Q(D0) |1} (5) D,D0 where D and D0 represent the adjacent data sets, Q(D) represents the output of the function Q on the data set D, .Q is the sensitivity and represents the maximum of the outputs’ difference. Additionally, because the .-differential privacy protec­tion scheme may be used manytimes in the different stages of processing data, the .-differential privacy protection scheme also needs to satisfy the following theorems: Theorem 3.1 for the same data set, the whole privacypro­tection process is divided to the different privacy protec­tion algorithms (M1,M2, ..., Mn), whose privacy protec­tion levels are .1, .2,...,.n, so the privacyprotection level n P .i of the whole process needs to satisfy differential pri­ i=1 vacyprotection. Theorem 3.2 for the disjoint data set, the whole privacy protection process is divided to the different privacyprotec­tion algorithms (M1,M2, ..., Mn), whose privacy protec­tion levels are .1, .2,...,.n, so the privacyprotection level max{.i} of the whole process needs to satisfy differential privacyprotection. 2Differential privacyprotection is to add noises to protect data, if data sensitivityis small, thenit caneffectively protect data whilea small quan­tity of noises are added into original data; on the contrary, if data sensitiv­ity is big, then a lot of noises need to be added into original data. 4 Trajectory data privacy protection scheme In the section, we propose a trajectory data privacy pro­tection scheme, which employs the Laplace’s differential privacymethod to protect the user’s trajectory data. In the proposed scheme, the algorithm frst selects the protected points from the user’s trajectory data; secondly, the algo­rithmbuilds the polygons according to the protected points and the adjacent and high frequent accessed points selected from the accessed point database, then the algorithm calcu­lates the polygon centroids; fnally, the noises are added to the polygon centroids by the Laplace’s differential privacy method, and the polygon centroids are used to replace the protected points, and then the algorithm constructs and is­sues the newtrajectory data. The procedure of the proposed scheme is described as follows: (1) Input the trajectory data I,the related and historic point data set D3, the radius r and the differential privacy protection parameters . and min_count4; (2) Select the protected point set A from the trajectory data I, then select the point data f . A and its correspond­ing adjacent points from D, where the adjacent points belong to the range of a circle that f is the center of the circle and r is the corresponding radius, and the frequent accessed counts of the adjacent points are no less than min_count, fnally form the point set B; (3) Traverse the setB, andbuild the corresponding poly­gons according to the points f and its corresponding adjacent points from B, whereonlyonepointinevery polygon belongs to the trajectory data I, and then cal­culate the corresponding polygon centroids, and form the polygon centroid set J, where ji(x, y) . J is the polygon centroid (see Section 4.2 for more details); (4) Use the Laplace’s mechanism to add the noises Lap( k·.Q ) into the set J, where the noises are added . into the polygon centroids, and then generate the set G (see Section 4.3 for more details); (5) Use the modifed polygon centroids from G to replace the correspondingly protected points f . A, and then issue the new trajectory data I0 . 4.1 Processing trajectory data The section describes how to select the related data from the trajectorydata I and the related and historic point data set D. The proposed algorithm selects the protected point 3The related and historic point data include the historic location points accessedby people and the corresponding accessed counts.To the trajec­tory data, we may save the historic trajectory data and the related infor­mation (including accessed time and accessed count) to the database, and then the data may be classifed to statistically form the set D. 4Our proposed scheme focuses on highly frequent accessed location data so as to distort attacker’s target. So, the setting of min_count is to improve the effciencyof the proposed scheme. KGu et al. data f . A and its adjacent points from D. Figure1shows the procedure of selecting the related data. In Figure 1, Figure1ProcessingTrajectory Data a random trajectory of one user is shown, where the red circles and the red arrows are used to show the trajectory, and the green circles denote the accessed historic location points5, whichbuild the related and historic point data6 set D. According to the Figure 1, the procedure of selecting the related data may be described as follows: – The proposed algorithm inputs the trajectory data I of one user, the related and historic point data set D and the related privacyprotection parameters r, . and min_count; – The algorithm selects the protected point set A from the trajectory data I; – The proposed algorithm forms the point set B accord­ing to the point data fi . A and its corresponding adjacent pointsfrom D, where the adjacent points be­long to the range of a circle that f is the center of the circle and r is the corresponding radius, and the fre­quent accessed counts of the adjacent points are no less than min_count. 4.2 Building polygon model The section describes how tobuild the polygon model to compute the polygon centroid. The proposed algorithm builds the polygons according to the protected pointsf . A and the corresponding adjacent points from D. Figure2 shows the procedureofbuilding polygon. In Figure 2, the trajectory of one user is f1,f2, ......f5 . I, and the points h1,h2, ......h13 with accessed counts come from D, where f2,f4 . A are the protected points. 5The adjacent point data may be related to other users. 6The historic duration is within one month. Figure2Building Polygon Model In the green circle that f2 is the center of the circle and r is the corresponding radius, the points h1,h2 and h4 (. D and their accessed counts . 50)and the pointf2 are used to form a polygon. Then the proposed algorithm computes the polygon centroid j1 (noises are added to j1 to generatea new point g1). Similarly, the algorithm may traverse the set B tobuild the polygons.We need to remark that the points h1,h2 and h4 is nearby the point f2, thus the points may be used tobuild the polygon so as to maintain the usability of the modifed trajectory, and that we set min_count is 50, thus some points whose accessed counts are less than 50 are not used to build the polygon in the green circle, such may distort the attacker’s target and improve the eff­ciencyof the proposed scheme. The procedureofbuilding polygon model may be described as follows: – The algorithm traverses the set B, and then selects the relevant and max-sized points to build the poly­gons according to the distance. For example, to a potential polygon, the algorithm selects N points as vertices from B whose coordinates are P (xi,yi) with i =1, 2, 3......N, where one of the N points is in the original trajectory, and the other points are nearby the point; – The algorithm computes the polygon centroids ac­cording to the vertices of the formed polygons. The formulas is described as follows: P|Pi| P|Pi| Pi.xk Pi.yk k=1 k=1 ji.x = , ji.y = . nn where Pi(xk,yk) is the coordinate of the k_thvertices of the i_th polygon, |Pi| is the vertices number of the i_th polygon, and ji(x, y) is the coordinate of the i_th polygon centroid. – The polygon centroids are formed to the set J, where ji(x, y) . J. 4.3 Adding noises based on the Laplace’s mechanism In the section, we show how to use the Laplace’s mech­anism to add the noises Lap(k·.Q )7 into the set J. The . main steps of the algorithm are described as follows: – Input the privacy protection level . and the polygon centroid set J,and then generate the noiseLap( k·.Q ) . satisfying the probability Pr(j(x, y),.), where -|j(x,y)| 1 . Pr(j(x, y),.)= · e . 2·. In the above formula, the variant j(x, y) denotes the corresponding coordinate of the polygon centroid and k·.Q . = . . – Add the noises Lap( k·.Q ) into the set J so as to dis­ . turb the polygon centroids8: ji.x = ji.x ± Lap( k·.Q ), . ji.y = ji.y ± Lap( k·.Q ), . where ji . J, ji(x, y) denotes the coordinate of the i_th polygon centroid, and Lap( k·.Q ) is each round . of the independent noise subjecting to the probability Pr(j(x, y),.). Finally,the algorithm generates the set G. – Use the modifed polygon centroids from G to replace the correspondingly protected points f . A, and then issue the new trajectory data I0 . For example, as the Figure2shown,the noiseis addedto j1 to generate a new point g1, and then g1 is used to replace the point f2,thus the original trajectoryf1 . f2 . f3 changes to f1 . g1 . f3. 5 Experiment and effciency analysis of the proposed scheme Inthe section,ourexperimentsaremainlyfromtwoaspects to evaluate the effciencyof the proposed scheme: the frst one is the running time of the proposed algorithms, namely the time of extracting the available data; the second one is the effectiveness of the proposed algorithms, whose in­dexes include the trajectory deviation rate and the trajectory accurate rate. The test original data set comes from the sim­ulation on the Baidu map9, which is similar to the Gowalla 7.Q is the sensitivity of the query function Q, where we set .Q = p max{(Pi.xk - ji.x)2 +(Pi.yk - ji.y)2} with i =1, 2, ......|NP | and k =1, 2, ......|Pi|, |NP | is the number of the polygons and |Pi| is the number of the vertices of every polygon. 8If the formed polygon is on the left of the protected point from the trajectory data I, then the operation “+” is used; otherwise, the formed polygon is on the right of the protected point from the trajectory data I, then the operation “ - ” is used. 9Baidu is a network companyin China. The baidu map is one of the network services provided by the company, which provides a lot of APIs for programmers to develop their applications on the map. data set10. The test original data set contains user_id, ac­cessed time, longitude and latitude and so on. The period of the test original data set is about one month. All pro­posed algorithms are codedby C++ and codeblocks11. The related parameters for the test are set asTable1. Table 1:ParameterValue Parameter Value (unit: 5meter) r 40,50,60,70,80,90,100,110 . 1,2,3,4,5,6,7,8,9,10,11,12 5.1 Running time analysis In the section, we test the running time of the proposed algorithms mainly through the time of extracting the avail­able data, namely we test the effectiveness of computing all the polygon centroids from the available data. In the tests, when we set r=70 and .=1,2,3,4,5,6,7,8,9,10,11,12 respec­tively, the time of extracting the available data is described asTable 2. From theTable2, we may know the timeofextracting theavailabledataisveryfast,andtheeffciencyof comput­ing all the polygon centroids from the available data is al­ways increasing with the increasing of . in a certain range. 5.2 Protection effectiveness analysis In the section, we test the protection effectiveness of the proposed algorithms mainly through the trajectory devia­tion rate and the trajectory accurate rate, where the trajec­tory deviation rateis the angle . formed by the modifed polygon centroid and the original trajectory points, shown as Figure 3, and if the trajectory deviation rate is bigger in a certain range, then the protection effectiveness is higher; the trajectory accurate rate is used to test the protection ef­fectiveness and usability of the noise-added data, and if the trajectory accurate rate is smaller in a certain range, then the usability is higher. In the test, we compute the trajectory accurate rate through the following methods: 1) set the coordinate (ai,bi) of the polygon centroid;2)compute thehypotenuse p=| 1- c 0 2 i ci = ai + b2 ;3)compute the accurate rateZ |, i ci 0 where ci is the original hypotenuse and ci is the noise-addedhypotenuse. The trajectory deviation rate is bigger in a certain range, the protection effectiveness is higher; the trajectory accurate rate is smaller in a certain range, the usability is higher. So, when we set . =5, 10, 15 and r = 40, 50, 60, 70, 80, 90, 100, 110 respectively, Ta­ble 3,4,5 show the deviation rate and accurate rate of the trajectory data. 10Gowalla is a location-based social networking website where users share their locations by checking-in. 11The test environment is underWin10 OS, Intel i5 CPU 2.3Ghz and 8G RAM. Figure3TrajectoryDeviation Angle From the Table 3, when . =5 and r< 90, we may know that the polygon centroid is not changed with the in­creasing of r , thus the deviation rate . and the accurate rate Z are also not changed. Such shows that in the range of r< 90,thenew points arenotselectedtobuildthenew polygon, thus the polygon is not modifed. when r>= 90, the new points are selected tobuild the new polygon, thus the polygon centroid is recomputed, thus the deviation rate . and the accurate rate Z are changed. Such shows that the deviation rate . could become big with the increasing of r, and the data usability becomes small. Also, from theTable 4andtheTable5, when. = 10, 15, we may get the similar resultsasthatoftheTable3. Additionally,whenwefxedly set r = 70 and . =1, 2, 3, 4, ......15 respectively,Table6 shows the deviation rate and accurate rate of the trajectory data. From theTable 6, we may know that the deviation rate . and the accurate rate Z are always increasing with the increasing of .. That is because the constraint condi­tion becomes small with the increasing of . in the differ­ential privacymechanism. However, such also shows that the deviation rate . becomes big so that the data usability becomes small. 6 Conclusions Currently, because the records of trajectory data are dis-cretein database, someexistingprivacyprotection schemes arediffcultto protect trajectory data.Inthis paper,wepro-pose a trajectory data privacyprotection scheme based on Laplace’s differential privacymechanism. In the proposed scheme, the algorithm frst selects the protected points from the user’s trajectory data; secondly, the algorithm builds the polygons according to the protected points and the ad­jacent and high frequent accessed points selected from the accessed point database, then the algorithm calculates the polygon centroids; fnally, the noises are added to the poly­gon centroids by the differential privacy method, and the new polygon centroids are used to replace the protected points, and then the algorithm constructs and issues the Table2: TheEffciencyof ExtractingAvailable Data . 1 2 3 4 5 6 7 8 9 10 11 12 Time (ms) 4 4 3 3 3 4 3 3 3 3 3 2 Table 3: Trajectory Deviation Rate And Accurate Rate (. =5) r ci c0 i Z . 40 645.264 613.125 0.049807 23.2510 50 645.264 613.125 0.049807 23.2510 60 645.264 613.125 0.049807 23.2510 70 645.264 613.125 0.049807 23.2510 80 645.264 613.125 0.049807 23.2510 90 608.511 572.839 0.058621 24.7920 100 608.511 572.839 0.058621 24.7920 110 608.511 572.839 0.058621 24.7920 Table 4: Trajectory Deviation Rate And Accurate Rate (. = 10) r ci c0 i Z . 40 645.264 613.096 0.049852 23.2532 50 645.264 613.096 0.049852 23.2532 60 645.264 613.096 0.049852 23.2532 70 645.264 613.096 0.049852 23.2532 80 645.264 613.096 0.049852 23.2532 90 608.511 572.809 0.05867 24.7941 100 608.511 572.809 0.05867 24.7941 110 608.511 572.809 0.05867 24.7941 Table 5: Trajectory Deviation Rate And Accurate Rate (. = 15) r ci 0 ic Z . 40 645.264 612.964 0.050057 23.2584 50 645.264 612.964 0.050057 23.2584 60 645.264 612.964 0.050057 23.2584 70 645.264 612.964 0.050057 23.2584 80 645.264 612.964 0.050057 23.2584 90 608.511 572.665 0.058908 24.7996 100 608.511 572.665 0.058908 24.7996 110 608.511 572.665 0.058908 24.7996 Table 6: Trajectory Deviation Rate And Accurate Rate (r = 70) . ci 0 ic Z . 1 645.264 613.126 0.049806 23.25090 2 645.264 613.126 0.049806 23.25090 3 645.264 613.126 0.049806 23.25090 4 645.264 613.126 0.049806 23.25090 5 645.264 613.125 0.049807 23.2510 6 645.264 613.125 0.049807 23.2510 7 645.264 613.122 0.049812 23.2514 8 645.264 613.117 0.049819 23.2518 9 645.264 613.109 0.049833 23.2524 10 645.264 613.096 0.049852 23.2532 11 645.264 613.079 0.049879 23.2541 12 645.264 613.057 0.049913 23.2551 13 645.264 613.030 0.049954 23.2562 14 645.264 612.999 0.050003 23.2573 15 645.264 612.964 0.050057 23.2584 new trajectory data. The experiments show that the run­ning time of the proposed algorithms is fast, the privacy protection of the scheme is effective and the data usability of the scheme is higher. Acknowledgement This study was funded by the National Natural Sci­ence Foundation of China (No.61402055,No.61504013), the Natural Science Foundation of Hunan Province (No.2016JJ3012). References [1] Ouyang Jia, Yin Jian, Liu Shaopeng, Liu Yuba. An Effective Differential Privacy Transaction Data Publication Strategy. Journal of Computer Research& development, 2014, 51(10):2195­2205. https://doi.org/10.7544/ issn1000-1239.2014.20130824 [2] Loki.Available at: http://loki.com/. [3] FireEagle.Available at: http://info. yahoo.com/privacy/us/yahoo/freeagle/. [4] Google latitude.Available at: http://www. google.com/latitude/apps/badge. [5] SamaratiP, SweeneyL. Generalizing data to provide anonymity when disclosing information. Proc. of the 7th ACM SIGACTSIGMOD-SIGART Symp. on Principles of Database Systems, 1998, 188­202. https://doi.org/10.1145/275487. 275508 [6] SamaratiP. Protecting Respondents Identities in Mi-crodata Release. IEEE Transactions on Knowledge and Data Engineering, 13(6):1010-1027. https:// doi.org/10.1109/69.971193 [7] Fung BC, Wang K, Yu PS. Anonymizing classifcation data for privacy preservation. IEEE Trans on Knowledge and Data Engi­neering(TKDE), 2007,19(5):711-725. https: //doi.org/10.1109/tkde.2007.1015 [8] Sweeney L. Achieving k-anonymity privacy pro­tection using generalization and suppression. International Journal of Uncertainty, Fuzzi­ness and Knowledge-Based Systems, 2012, 10(5):571-588. https://doi.org/10.1142/ S021848850200165X [9] Xiao X,TaoY. Anatomy: Simple and effective pri­vacy preservation. Proc of VLDB 2006. New York: ACM, 2006, 139-150. [10] Zhang Q, Koudas N, Srivastava D, et al. Aggregate query answering on anonymized tables. Proc of the 23rd Int Conf on Data Engineering(ICDE). Piscat-away, NJ: IEEE, 2007, 116-125. https://doi. org/10.1109/icde.2007.367857 [11] Wong RCW, Li J, Fu AWC, Wang K. (a, k)­Anonymity: An enhanced k-anonymity model for privacy-preserving data publishing. In: Proc.of the ACM 12th SIGKDD Int’l Conf on Knowledge Dis­covery and Data Mining, 2006, 754-759. https: //doi.org/10.1145/1150402.1150499 [12] LeFevre K, DeWitt DJ, Ramakrishnan R. Mon­drian multidimensional k-anonymity. Proc.of the 22nd Int’l Conf. on Data Engineering, 2006, 6(3): 25-35. https://doi.org/10.1109/ ICDE.2006.101 [13] Machanavajjhala A, Gehrke J, Kifer D, Venkita­subramaniam M. L-Diversity: Privacy beyond k-anonymity. Proc.of the 22nd IEEE Int’l Conf. on Data Engineering, 2006. https://doi.org/10. 1109/ICDE.2006.1 [14] Xiao X, Tao Y. Personalized privacy preservation. Proc.of the 2006ACM SIGMOD Int’lConf. on Man­agement of Data, 2006, 229-240. https://doi. org/10.1145/1142473.1142500 [15] Xu J, Wang W, Pei J, Wang X, Shi B, Fu AWC. Utility-Based anonymization using local recoding. Proc.of the 12th ACM SIGKDD Int’l Conf. on Knowledge Discovery and Data Mining, 2006, 785­790. https://doi.org/10.1145/1150402. 1150504 [16] LiN,LiT,VenkatasubramanianS. t-closeness: Pri­vacybeyond k-anonymity and l-diversity. Proc.of the 23rd IEEE Int’l Conf. on Data Engineering, 2007, 106-115. https://doi.org/10.1109/icde. 2007.367856 [17]Wong RCW,FuAWC,WangK, PeiJ. Minimality at­tack in privacypreserving data publishing. Proc.of the 33nd Int’lConf. onVery Large Databases, 2007, 543­554. [18] Tao Y, Xiao X, Li J, Zhang D. On anti-corruption privacypreserving publication. Proc.of the 24th Int’l Conf. on Data Engineering, 2008, 725-734. https: //doi.org/10.1109/ICDE.2008.4497481 [19] Backstrom L, Dwork C, Kleinberg J. Wherefore are thouR3579X?: Anonymized social networks, hid-dern patterns and structural steganography. Proc.of the 16th Int’lConf. onWorldWideWeb, 2007, 181­190. https://doi.org/10.1145/1242572. 1242598 [20] Zheleva E, Getoor L. Preserving the privacyof sensi­tive relationships in graph data. Proc.of the 1stACM SIGKDD Workshop on Privacy, Security, and Trust in KDD, 2007, 153-171. https://doi.org/10. 1007/978-3-540-78478-4\_9 [21]KorolovaA,MotwaniR, NabarSU,XuY.Linkpri­vacyin social networks. Proc. of the 24th Int’lConf. on Data Engineering, 2008, 1355-1357. https:// doi.org/10.1109/icde.2008.4497554 [22] Cristofaro E, Soriente C, Tsudik G, et al. Humming­bird: Privacy at the time of twitter. IEEE Sympo­sium on Security and Privacy -S&P, 2012, 285-299. https://doi.org/10.1109/sp.2012.26 [23] Beresford AR, Rice A, Skehin N, Sohan R. Mock-Droid: Trading privacy for application functional­ity on smartphones. Proc. of the 12th Workshop on Mobile Computing Systems and Applications, ACM Press, 2011, 49-54. https://doi.org/ 10.1145/2184489.2184500 [24] Huo Z, Meng XF. A survey of trajectory privacy-preserving techniques. Chinese Journal of Com­puters, 2011, 34(10):1820-1830. https://doi. org/10.3724/sp.j.1016.2011.01820 [25] BambaB, LiuL, PestiP,WangT. Supporting anony­mous location queries in mobile environments with privacy grid. Proc. of the 17th Int’l Conf. onWorld Wide Web. New York: ACM Press, 2008, 237­246. https://doi.org/10.1145/1367497. 1367531 [26] Liu L. From data privacy to location privacy: Mod­els and algorithms. Proc. of the 33rd Int’l Conf. on VeryLargeData Bases.NewYork:ACMPress,2007, 1429-1430. [27] LiuF, Hua KA, CaiY. Query l-diversity in location-based services. Proc. of the 10th Int’lConf. on Mobile Data Management.Taipei, 2009, 436-442. https: //doi.org/10.1109/mdm.2009.72 [28] Ting W, Ling L. From Data Privacy to Location Privacy. Machine Learning in Cyber Trust, Berlin: Springer, 14 March 2009, pp.217-246. https:// doi.org/10.1007/978-0-387-88735-7_9 [29] Bugra G, Ling L. Protecting Location Privacy. IEEE Transactions on Mobile Computing, 2008, 7(1):1­18. https://doi.org/10.1109/TMC.2007. 1062 [30] He X, Cormode G, Machanavajjhala A, Procopiuc CM, Srivastava D. Differentially Private Trajectory Synthesis Using Hierarchical Reference Systems. VLDB Journal, 2015, 8(11):1154-1165. https:// doi.org/10.14778/2809974.2809978 [31] Chatzikokolakis K, Palamidessi C, Stronati M. A Predictive Differentially-Private Mechanism for Location Privacy. Proc. of the 14th Interna­tional Symposium on Privacy Enhancing Tech­nologies, Berlin: Springer, 2014, LNCS 8555, pp.21-41. https://doi.org/10.1007/ 978-3-319-08506-7\_2 [32] ChatzikokolakisK,PalamidessiC, StronatiM. Geo-indistinguishability:APrincipled Approach to Loca­tion Privacy. ICDCIT 2015, Berlin: Springer, 2015, LNCS 8956, pp.49-72. https://doi.org/10. 1007/978-3-319-14977-6_4 [33] Li YD, Zhang Z, Winslett M, et al. Compressive mechanism: Utilizing sparse representation in dif­ferential privacy. Proc. of the 10th Annual ACM Workshop on Privacyin the Electronic Society. New York: ACM, 2011, pp.177-182. https://doi. org/10.1145/2046556.2046581 [34] Zhang XJ, Wang M, and Meng XF. An Ac­curate Method for Mining top-k Frequent Pat­tern Under Differential Privacy. Journal of Computer Research and Development, 2014, 51(1):104-114. https://doi.org/10.7544/ issn1000-1239.2014.20130685 [35] Dwork C. The promise of differential privacy: A tu­torial on algorithmic techniques. Proc. of the Foun­dations of Computer Science (FOCS). Piscataway, NJ: IEEE, 2011, pp.1-2. https://doi.org/10. 1109/focs.2011.88 [36] Dwork C. A frm foundation for private data analy­sis. Communications of the ACM, 2011, 54(1):86­95. https://doi.org/10.1145/1866739. 1866758 [37] McsherryF,Talwar K. Mechanism design via differ­ential privacy. Proc. of the 48th Annual IEEE Symp. onFoundationsof Computer Science (FOCS), Piscat-away, NJ: IEEE, 2007, pp.94-103. https://doi. org/10.1109/focs.2007.4389483 [38] DWork C, McSherry F, Smith A. Calibrating noise to sensitivity in private data analysis. Proc. of the 3th Theory of Cryptography Conf (TCC06), Berlin: Springer, 2006, pp.265-284. https://doi.org/ 10.1007/11681878_14 https://doi.org/10.31449/inf.v42i3.1497 Informatica 42 (2018) 417–438 417 A Hybrid Particle Swarm Optimization and Differential Evolution Based Test Data Generation Algorithm for Data-Flow Coverage Using Neighbourhood Search Strategy Sapna Varshney and Monica Mehrotra Department of Computer Science, Jamia Millia Islamia, India E-mail: sapna_varsh@yahoo.com, drmehrotra2000@gmail.com Keywords: search based software testing, particle swarm optimization, differential evolution, data flow testing, dominance tree Received: January 15, 2017 Meta-heuristic search techniques, mainly Genetic Algorithm (GA), have been widely applied for automated test data generation according to a structural test adequacy criterion. However, it remains a challenging task for more robust adequacy criterion such as data-flow coverage of a program. Now, focus is on the use of other highly-adaptive meta-heuristic search techniques such as Particle Swarm Optimization (PSO) and Differential Evolution (DE). In this paper, a hybrid (adaptive PSO and DE) algorithm is proposed to generate test data for data-flow dependencies of a program with a neighbourhood search strategy to improve the search capability of the hybrid algorithm. The fitness function is based on the concepts of dominance relations and branch distance. The measures considered are mean number of generations and mean percentage coverage. The performance of the hybrid algorithm is compared with that of DE, PSO, GA, and random search. Over several experiments on a set of benchmark programs, it is shown that the hybrid algorithm performed significantly better than DE, PSO, GA and random search in data-flow test data generation with respect to the measures collected. Povzetek: Razvit je nov algoritem kot kombinacija hibridnega roja delcev in diferenčne evolucije z uporabo sosednje iskalne strategije. 1 Introduction Software testing aims at assessing the quality and Search, Hill Climbing and Simulated Annealing [2, 3, 4]. reliability of software product by detecting as many In the past two decades, evolutionary search-based defects as possible. The cost of software testing increases algorithms such as Genetic Algorithm (GA) have been exponentially with the size of input search space, thereby widely employed for test data generation as an effective making manual testing a difficult and tedious task. There alternative [5, 6, 7, 8, 9]. A search-based approach are software testing tools available with capture and captures the test adequacy criteria as a fitness function playback features to automate the execution of test that is used to guide the search. Due to an extensive scripts. However, the test cases are manually selected by application of search-based algorithms to test data the human tester and may not be optimal. It is therefore generation problem, the approach has come to be known desirable to generate optimal test data that reveals as as Search Based Software Testing (SBST, coined by many errors as possible according to a test adequacy Harman and Jones). Recently, the focus is on the use of criterion [1]. Structural (white-box) testing tests software other highly adaptive search-based techniques such as for its structure and has the inherent capability to expose Particle Swarm Optimization (PSO), Ant Colony faults. The structural test adequacy criteria can be Optimization (ACO) and Differential Evolution (DE). It statement coverage, branch coverage, or path coverage has been observed that GA and ACO have slow that aim at executing every statement, branch or path convergence towards the optimal solution. PSO and DE respectively at least once. Data-flow coverage, an are conceptually very simple and the knowledge of effective and robust test adequacy criterion, focuses on previous good solutions is retained by all the members of the definition and usage of variables in a program. Data-the current population by means of constructive flow testing, therefore, could lead to more efficient and cooperation among them. PSO and DE have been found targeted test suites. to be robust in solving optimization problems; however, The attempts to reduce the cost of software testing the performance depends on control parameters. PSO has by automating the process of software test data been shown to be well suited for test data generation with generation have been constrained by the ever increasing better performance than GA [10, 11, 12, 13, 14]. size and complexity of software. In the early period of Hybridization of search-based algorithms for test data automated test data generation, gradient descent and generation has also been reported in literature. GA with a meta-heuristic search (MHS) algorithms such as Tabu local search algorithm [15] and more recently, GA with PSO has been applied for test data generation in some studies [16, 17, 18, 19, 20, 21]. In this study, we propose a hybrid global search algorithm by combining an adaptive PSO with DE mutation operator to automatically generate test data for data-flow dependencies of a program. In the proposed hybrid algorithm, a new term based on DE differential operator is included for velocity update in PSO for some additional exploration capability. The greedy selection scheme of DE is used wherein position of a particle is updated only if it yields a better fitness value. This results in movement of particles only to better locations in the input search space. A local neighborhood strategy is also included in the proposed hybrid algorithm to explore more promising candidate solutions and overcome the problem of boundary constraints. Design of the fitness function [22] is based on dominance concepts and branch distance that is used to guide the search for optimal test data for data-flow dependencies of a program. The performance of the proposed hybrid algorithm is compared with that of DE, PSO, GA and random search. It is demonstrated that the proposed hybrid algorithm outperformed DE, PSO, GA and random search in terms of mean percentage coverage achieved, and mean number of generations to produce the final test suite for data-flow coverage of a program. The rest of the paper is organized as follows: Section 2 provides a brief description of automated software test data generation process and related work. Section 3 provides an overview of data-flow analysis. Sections 4 and 5 provide a brief description of PSO and DE algorithms. Section 6 describes the proposed hybrid algorithm. Section 7 gives the experimental results. Section 8 provides the discussion and the detailed statistical analysis of the experimental results. Section 9 deals with threats to validity and limitations of the proposed hybrid algorithm. Finally, section 10 gives the conclusion. 2 Related work This section presents the methods to generate test data for software structural testing and the related literature. Symbolic execution, a static method, has been employed for test data generation [2]; however, the performance is constrained by programming constructs such as pointers, loop conditions with input variables, array subscripts and procedure calls [23]. Dynamic methods that have been employed for test data generation can be classified as random, path-oriented and goal-oriented techniques [9, 23]. A random test data generator arbitrarily selects test data from the input domain. Though easy to implement, it may fail to find optimal test data. Path-oriented test data generator [5] uses control flow information to identify a set of independent paths to generate test data. However, it does not work well with infeasible paths or paths that contain loops. A goal-oriented test data generator [9, 23, 24] generates test data for a selected goal such as a statement or a branch, irrespective of the path taken. S. Varshney et al. The meta-heuristic search techniques guided by a fitness function have been adopted to generate optimal test data mainly according to a structural test adequacy criterion. From the literature on structural test data generation, it can be inferred that branch coverage and path coverage are the most often used and well-understood measures [25]. For branch coverage, fitness values are calculated by finding approximation level and branch distance for a target branch from control flow graph [8, 26]. Data-flow coverage criterion has not been used much [27] due to difficulty in writing test cases that satisfy data-flow dependencies of a program. Wegener et al. [28] defined different types of fitness functions for structural testing; data-flow test criteria being classified as node-node-oriented methods. Recently only there has been more work on search based test data generation for data-flow coverage using GA as the algorithm of choice [6, 7, 22, 24, 29, 30]. Now, other highly adaptive search-based techniques such as PSO [14, 18] and ACO [31] are also being applied to generate test data for data-flow coverage due to simplicity and faster convergence. ACO [32] and Harmony Search [33] has also been applied to generate structural test data for branch coverage. Vivanti et al. [30] have proposed a GA-based technique for data-flow coverage evaluated on open source Java applications. The results have indicated the scalability and applicability of data-flow criteria for test data generation. In our previous work [22], an elitist GA-based approach is proposed to generate test data for data-flow dependencies of a program using dominance concepts and branch distance. The fitness function is derived from the work by Ghiduk et al. [6]; it is augmented with branch distance to produce a smoother landscape for guiding the search and also takes into account that a definition may be killed by another definition before the associated use is reached. The performance of the proposed approach is compared with random search and earlier studies on test data generation for data-flow dependencies of a program by Girgis [7], Ghiduk et al. [6] and Girgis et al. [21]. The proposed GA-based approach guided by the novel fitness function outperformed random search and the earlier studies [6, 7, 21] to generate test data for data-flow coverage of a program. Windisch et al. [10] applied PSO to artificial and complex industrial test objects to generate test data for branch coverage. Their results showed efficiency and efficacy of PSO over GA for most code elements to be covered. Agarwal et al. [11] applied binary PSO, Agarwal and Srivastava [12] applied discrete quantum PSO and Mao [13] applied standard PSO to generate test data for branch coverage test adequacy criterion. Nayak and Mohapatra [14] proposed an algorithm to generate test cases using PSO for data flow coverage. This technique cannot rank test cases because the .tness function, as simply taken from Girgis [7], assigns the same fitness value to all the test cases that cover the same number of test requirements and a fitness value of 0 to all the test cases that do not cover any test requirement or cover a partial aim. Here, the fitness function is unable to guide the search. Application of hybrid algorithms have also been studied for test data generation problem. Zhang et al. [16] proposed a hybrid algorithm (GA and PSO) to generate test data for path coverage. GA and PSO operations are applied to two population sets. Triangle classification problem is taken as the case study and the hybrid algorithm is compared with GA and PSO. The hybrid algorithm is shown to be better than GA and PSO with respect to number of iterations. The average time taken is found to be more than PSO but less than GA. Their hybrid technique is complicated and may generate redundant test cases for automatic test data generation. Li et al. [17] also proposed a hybrid algorithm (GA and PSO) to generate test data for path coverage. PSO equations to update particle’s velocity and position distance are used instead of mutation operator of GA. The algorithm is applied only to the triangle benchmark problem. Singla et al. [18] applied a hybrid algorithm (GA and PSO) to generate test data for data-flow coverage. The fitness function used is same as in [6]; it does not take into account the traversal of killing nodes as well as closeness of test data in case if only partial aim is covered. The strategy is tested only on some simple programs. Kaur and Bhatt [19] proposed a hybrid algorithm (GA and PSO) to prioritize test data in regression testing. The algorithm has been tested on few simple programs. Girgis et al. [21] proposed a hybrid Genetical Swarm Optimization (GSO) Technique to generate a set of test paths that cover the all-uses criterion for data-flow coverage. The authors have claimed that the set of paths generated by the proposed GSO can be passed to a test data generation tool to find program inputs that will execute them to complete the data flow paths testing of the program under test. The fitness function used is same as in [7]; it is not able to guide the search and results in loss of valuable information in case if only partial aim is covered. Chawla et al. [20] proposed a hybrid PSO and GA algorithm for automatic generation of test suites with branch coverage as the test adequacy criterion. The experiments are performed with ten Java container classes. The algorithm is shown to perform better than GA, PSO and existing hybrid strategies based on GA and PSO. Each optimization algorithm has its own advantages and disadvantages. Also, one optimization algorithm will not work well for all the optimization problems. DE, a meta-heuristic search-based algorithm, has been applied to several optimization problems [34, 35] to demonstrate its potential. Das et al. [36] has explored hybridization of PSO with DE applied to the design of digital filters. However, DE has not been applied for test data generation and optimization problem [25, 27, 37]. The proposed study will focus on the application of a hybrid adaptive PSO-DE algorithm to generate test data for data-flow dependencies of a program. The proposed hybrid global search algorithm combines the evolution Informatica 42 (2018) 417–438 419 scheme of both PSO and DE incorporating the best of both the algorithms in the context of test data generation. A new term based on DE differential operator is included for velocity update in PSO. The greedy selection scheme of DE is also used wherein position of a member is updated only if it yields a better fitness value. The hybridization scheme has resulted in movement of particles only to better locations in the input search space. The design of fitness function [22] is based on the dominance relations between the nodes of a program’s control flow graph augmented with branch distance which produces a smoother landscape for guiding the search. This leads to faster and better convergence of test data to achieve the desired coverage. A neighborhood search strategy is also incorporated into the proposed hybrid algorithm that further helps in overcoming the problem of boundary constraints and local optima by exploring more promising candidate solutions. This is the main contribution of this paper. The proposed hybrid algorithm generates test data for one test requirement at a time; other test requirements are also checked for coverage thereby reducing the overall number of fitness evaluations. 3 Data flow analysis In this study, data-flow coverage is used as the test adequacy criteria. Data-flow analysis [38] augments the control-flow testing criteria; the emphasis is on the definition and use of the variables in a program. The control flow of a program is represented by a directed graph G (V, E) also known as control flow graph (CFG), where V is the set of all the nodes and E is the set of all the edges in the graph. Each node corresponds to a program statement or group of sequential program statements and an edge represents flow of control from one node to another. There are two distinct nodes: an entry node n0 and an exit node nend. Node n dominates node m (dominance relationship) if every path from entry node n0 to m contains n. By applying dominance relationship to all the nodes of CFG, a tree can be obtained that is rooted at n0. This tree is called the dominator tree [39]. For each node m in the CFG, Dom (m) is the set of all the nodes that dominate node m. Figure 2 gives the CFG of the example program as given in Figure 1. The dominator tree is shown in Figure 3. For example, Dom (12) = {1, 2, 6, 7, 12}. In a program, the definition and use occurrences of each variable are identified. A variable is said to be defined in a program statement (def-node) if a value is associated with the variable. A variable is said to be used in a program statement if its value is referenced for computational use (c-use node) or a predicate use (p-use node). Data-flow testing should cause the traversal of def-clear sub-paths from the variable definition to either some or all of the p-uses, c-uses, or their combination. Empirically, the all-uses criterion has been shown to be most effective compared to the other data-flow criteria [40]. A def-clear path does not include any intermediate nodes containing other definitions of that variable (killing nodes). A def-clear path can be further #include #include 1 1 void main() { 2 1 int a, b, c, valid; 3 1 printf(“\nEnter the value of three sides: “); 4 1 scanf(“%d%d%d”, &a, &b, &c); 5 1 valid=0; 6 2 if((a>=0)&&(a<=100)&&(b>=0)&&(b<=100)&&(c>=0) &&(c<=100)) { 7 3 if(((a+b)>c)&&((c+a)>b)&&((b+c)>a)) { 8 4 valid=1; 9 5 } 10 5 } 11 6 if (valid==1) { 12 7 if ((a==b)&&(b==c)) 13 8 printf(“\nEquilateral triangle.”); 14 9 else if ((a==b)||(b==c)||(c==a)) 15 10 printf(“\nIsosceles triangle.“); 16 11 else 17 11 printf((“\nScalene triangle.“); 18 12 } else { 19 13 printf(“\n Invalid input ”).; 20 14 } 21 15 } Figure 1: Triangle classification program. Table 1: List of variables and def-use occurrences in the example program Variable def Node c-use Node p-use Edge a b c 1 None 2-3 2-6 3-4 3-5 7-8 7-9 9-10 9-11 valid 1,4 None 6-7 6-13 Table 2: List of def-use paths for the example program. Path No. def-use Path (Terminates with -1 for c-use) Killing Node(s) 1 1-2-3 None 2 1-2-6 None 3 1-3-4 None 4 1-3-5 None 5 1-7-8 None 6 1-7-9 None 7 1-9-10 None 8 1-9-11 None 9 1-6-7 4 10 1-6-13 4 11 4-6-7 None 12 4-6-13 None categorized as a dcu-path (c-use of the variable) or a dpu-path (p-use of the variable). For the example program, Table 1 provides definition and use nodes for each variable, Table 2 provides the list of all-def-use paths and Figure 2: CFG of the example program. Node No. Dominance Path 1 1 2 1-2 3 1-2-3 4 1-2-3-4 5 1-2-3-5 6 1-2-6 7 1-2-6-7 8 1-2-6-7-8 9 1-2-6-7-9 10 1-2-6-7-9-10 11 1-2-6-7-9-11 12 1-2-6-7-12 13 1-2-6-13 14 1-2-6-14 15 1-2-6-14-15 Table 3 provides the dominance paths for the nodes of the program flow graph. Particle swarm optimization In 1995, Kennedy and Eberhart [41] introduced Particle Swarm Optimization algorithm, a population-based search algorithm based on the social and cognitive behavior of different swarms such as flock of birds, herd of animals or school of fishes. The application of PSO for solving many continuous space problems in the field of Computer Science and Engineering has demonstrated its potential. Unlike GA, PSO does not use evolution operators such as crossover and mutation. Instead, each member of the swarm (called particle) attains optimal solution by learning from its own experience and the experience of other members of the swarm. Each particle maintains its current position, current velocity and the best position it has achieved so far, called pbest. The global best position of the swarm is called gbest. Both pbest and gbest are used by the particle in determining its next best position in the swarm. Thus, the knowledge of previous good solutions is retained by all the particles resulting in a faster convergence towards the optimal solution. Consider a swarm of n particles denoted as (p1, p2... pn). Position of the ith particle in the d-dimensional search space is denoted as Xi = (Xi1, Xi2…Xid) and the associated velocity is denoted as Vi = (Vi1, Vi2…Vid). The personal best position of the ith particle in dimension d is denoted as pbestid. The position of the best particle of the entire swarm in dimension d is denoted as gbestd. The velocity and position of the ith particle in dimension d can be updated by Equations 1 and 2 as given below. Vid = w×Vid + c1×r1×(pbestid -Xid) + c2×r2×(gbestd –Xid) (1) Xid= Xid+ Vid (2) where, c1 and c2 are positive learning constants called cognitive and social scaling parameters chosen in such a way that their sum never exceeds 4, and r1 and r2 are two random numbers in the range [0,1]. The inertia weight w controls the impact of the previous history on the new velocity of the ith particle. A particle’s velocity in each dimension is clamped to a maximum magnitude Vmax. The position and velocity of each particle in the swarm are continuously updated until an optimal solution is achieved. 4.1 Adaptive inertia weight In PSO algorithm, a large value of inertia weight facilitates exploration (global search) of the input search space and a small value of inertia weight facilitates exploitation (local search) of the input search space for the optimal solution. Various inertia weighting strategies used in the literature have been categorized into constant, random, time varying and adaptive inertia weight strategies [42]. In constant and random inertia weight strategies, value of inertia weight is either constant or is chosen randomly during the search. In time varying inertia weight strategies, inertia weight is de.ned as a function of time or iteration number. Here, value of inertia weight is independent of the state of the particles in the search space. In adaptive inertia weight strategies, Informatica 42 (2018) 417–438 421 state of the particles in the search space (feedback mechanism) is used to adjust the value of the inertia weight. In this study, fitness value of the particles is used to adjust the inertia weight. Ratio . of the particle’s fitness to the average fitness of the swarm is calculated as shown in Equation 3 below: .= fi / fmax (3) ith Here, fi=fitness of particle and fmax is the maximum fitness achieved by the particles in the swarm. The range of . is [0, 1]. For lower values of ., increasing inertia weight can strengthen the particle’s search capability. For values of . that are closer to 1, smaller inertia weight should be used. The inertia weight ith wi for the particle is therefore defined as a linear function of. and is calculated as follows: wi = 0.5×(1-.)+ 0.5C(4) The range of the inertia weight is [0.5, 1]. PSO is computationally inexpensive. The ability of PSO to balance between local exploitation and global exploration of the search space enhances searching ability and avoids premature convergence towards the optimal solution. 5 Differential evolution Differential Evolution (DE) algorithm was given by Storn and Price [43] in 1995. It is a stochastic population-based global optimization algorithm that uses an evolutionary differential operator to create new offspring from parent chromosomes. Unlike GA, DE works upon real-valued chromosomes. The differential operator of DE replaces the classical crossover and mutation operators of GA. Let’s say, the initial population consists of n vectors denoted as (p1, p2... pn). Position of the ith vector in the d-dimensional space is denoted as Xi = (Xi1, Xi2…Xid). These vectors are referred as chromosomes in DE. To change each chromosome (target vector),a difference vector Vi is created. In the literature, there are various mutation schemes to create this vector. In this paper, DE/Rand/1 scheme is used. In this scheme, for each ith member Xi of the current population, three other members (say r1, r2 and r3) are randomly chosen from the current population. Next, the scaled di.erence (mutation scaling factor F) of any two of the three vectors is added to the third one to obtain the difference vector Vi. The jth component of the difference vector is as given below: vi,j = xr1,j + F×(xr2,j ­xr3,j) (5) To increase the population diversity, a ‘crossover scheme’ is applied. The difference vector exchanges its components with the target vector Xi to obtain the offspring/trial vector Ui. The most common crossover in DE is ‘uniform crossover’ as given below: ui,j = vi,j if rand(0,1) < CR = xi,j else (6) CR is called the crossover constant. The final step in DE algorithm is the fitness-based selection of either target vector or trial vector in the next generation. F and CR are the control parameters of DE. The performance of DE depends on the manipulation of target vector and difference vector in order to obtain a trial vector. 6 Proposed hybrid algorithm In the proposed study, an adaptive PSO algorithm is hybridized with the DE algorithm incorporating local neighborhood search strategy. The synergy between PSO and DE algorithms has resulted in a more powerful global search algorithm. The local neighborhood search strategy helps in exploring more promising candidate solutions to overcome the problem of local optima. In the proposed hybrid (adaptive PSO and DE) algorithm, a di.erential velocity term inspired by the DE mutation scheme is computed by taking the di.erence of the position vectors of any two distinct particles randomly chosen from the swarm. A random number r is generated between 0 and 1. If r is less than DE crossover probability, Equation 7 (given below) is used to update the velocity of a particle. In Equation 7, the cognitive term (second term) in Equation 1 is replaced by the differential term scaled by DE mutation scaling factor. Vid = w×Vid + F×(xjd ­xkd) + c2×r2×(gbestd –Xid) (7) Here, xj and xk denote the position of particles j and k respectively (i.j.k) that are randomly chosen from the swarm. A survival of the .ttest mechanism is also followed by incorporating the greedy selection scheme of DE as given by Equation 6. Therefore, the particle either moves to a better location or remains at its previous position in the input search space. The current position of a particle will always be its best position. The steps of the proposed hybrid (adaptive PSO and DE) algorithm are given in Figure 5. The flowchart is given in Figure 6. Inputs to the algorithm are an instrumented program, dominator tree of the program, list of def-use paths to be traversed and the killing nodes if any, number of input variables, domain range of each input variable, and the algorithmic parameters: population size, PSO acceleration parameters, PSO maximum velocity, DE mutation scaling factor and DE crossover probability. Adaptive inertia weight is used as given by Equations 3 and 4. For data-flow coverage criterion, the design of fitness function is explained in Section 6.2 below. Initial value of pbest and gbest is 0. The algorithm is run once for each uncovered def-use path. If the selected path is not covered by any member of the current population, fitness value is computed for each member. Accordingly, for each particle, the personal best position pbest and the global best position gbest can be updated. During the evolution process, particle’s position and velocity is adjusted according to Equations 2 and 7 respectively. If the updated position of the particle is out of input domain range, a local S. Varshney et al. neighbourhood strategy is applied. Then, the greedy selection scheme of DE is used to generate the new population. The evolution process continues until the termination criteria is met. The other uncovered paths are also checked for coverage. The output is an optimal test suite and a list of def-use paths marked as covered or uncovered, if any. A tool is developed for instrumenting programs and to generate def-use paths. Dominator tree is generated manually. Infeasible paths, if any, are determined by careful analysis of the program. 6.1 Neighbourhood search strategy Every meta-heuristic search algorithm suffers with the problem of local optima. Another issue related to meta-heuristic search algorithms is boundary constraints. There are no set mechanisms to deal with such problems. Hence, in this study, an effort is also made to handle the problems of local optima and boundary constraints and to improve the exploitation ability of the algorithm. A neighbourhood search strategy (Figure 4) is introduced to sample more promising candidate solutions to overcome these problems. It is summarized as follows: Step 1: For each particle, Euclidean distance is calculated from the other particles in the input search space using the position of particles. Accordingly, other particles within a threshold Euclidean distance (determined by preliminary study to fine-tune the algorithmic parameters) form the neighbourhood. Euclidean distance between two particles Xi and Xj in the n-dimensional search space is given by the following equation: nC2C dij=..(xik-xjk)C(8)C k=1C Step 2: If a particle’s new position is out of range, other particles in the neighbourhood are evaluated. Step 3: The position of the particle is then replaced with that of the best particle in the neighbourhood instead of a random value. This helps in exploring more promising candidate solutions. 6.2 Design of Fitness Function Def-use associations can be represented as node-node fitness functions [28]. Def-use associations specify the node of definition and the node of use for the program variables in the CFG without specifying a concrete path between the nodes. This implies that the first objective to reach is the definition node and then the use node, without however, specifying a path through the CFG. The distance to a node is represented by the standard minimizing metric given below: nodedistance=approachlevel+ v(branchdistance) (9)C Figure 4: Local Neighbourhood Strategy. It evaluates to 0 if the target is covered. Approach level is the closest point (a node) of a given execution to the target node. A branch is said to be critical if it leads the program execution away from the target node in a path through the program structure [44]; branch distance is calculated at that particular predicate node using values of the variables according to the formulae given in Table 4 [3] below. Table 4: Branch distance measure for relational and logical predicates. S. No. Predicate (C) Branch Distance Formulae: f(C) 1 Boolean if true then 0 else K 2 x = y if (x-y)=0 then 0 else abs(x-y)+K 3 x . y if abs(x-y).0 then 0 else K 4 x > y if (y–x)<0 then 0 else (y-x)+K 5 x . y If (y–x).0 then 0 else (y-x)+K 6 x < y if (x–y)<0 then 0 else (x-y)+K 7 x . y if (x–y).0 then 0 else (x-y)+K 8 C1 && C2 f(C1) + f(C2) 9 C1 || C2 min(f(C1), f(C2)) K is a failure constant that is added to branch distance if predicate is false Branch distance provides a measure of how close the program execution was to traverse the alternate edge of the critical branch. Branch distance is normalized in the range [0, 1] using a normalization function v, such that the approach level always dominates the branch distance. In our previous study [22], a novel maximizing fitness function is proposed for data-flow coverage adequacy criterion based on the standard metric (Equation 9) and dominator tree. Dominance relations between the nodes of the CFG are used to obtain path-cover for the nodes of the selected def-use path. The fitness function considers each def-use path as two objectives. For a dcu-path, the first objective is to cover the dominance path of the definition node and then to cover the dominance path of the use node. For a dpu-path, the first objective is to cover the dominance path of the definition node and then to cover the dominance paths of the nodes of the p-use edge (u1, u2). A dpu-path is formed for both the branches (T/F) of the predicate node. A test case is evaluated with respect to the selected def-use path by executing the program under test with it as an input and recording the nodes that are covered. If a killing node is traversed between the source node and the use node, a fitness value of 0 is assigned to the test case and it is discarded. The fitness value is 1 if all the nodes of the dominance paths of both the objectives are covered; otherwise closeness of the test case to the missed objective (branch distance) is computed. In this work, for fitness maximization, branch distance bch(x, ti) at the critical branch for test case ti and target node x is the reciprocal of the value returned by an appropriate formula from Table 4 i.e. the closer a test case is to cover the required branch, higher is its fitness value. The fitness function uses control-flow information (dominance relations between the nodes of the CFG) augmented with branch distance if a partial aim is achieved. This provides a smoother landscape/guidance to the search process towards the optimal solution. Branch distance is computed using Equation 10 and the fitness functions are given by Equations 11 and 12 as explained below. Branch distance bch (x, ti) for test case ti (i=1...p) and target node x, for fitness maximization, is calculated as follows: bch(x,ti)C1CifthetestcasetileadstothetargetnodexC otherwise,usinganappropriateCformulafromC =Cf(C)C(10)C Table4CforCthepredicateCatthecriticalbranchC{C The fitness function to evaluate the fitness of a test case ti (i=1...p) w.r.t. a dcu-path (d, u, v), where d is the definition node and u is the c-use node of a variable v, is given below: 1C|cdom(d, ti)|C|cdom(u,ti)|C ft(d, u,Cti)= ×(×bch(d, ti)+C×bch(u,tI))C 2C|dom(d)|C|dom(u)|C (11)C Similarly, the fitness function to evaluate the fitness of a test case ti (i=1...p) w.r.t. a dpu-path (d, (u1, u2), v), where d is the definition node and (u1, u2) is the p-use edge of a variable v, is given below: |cdom(d, ti)|C|cdom(u1, ti)|C ×bch(d, ti)+C 1C|dom(d)|C|dom(u1)|Cft(d, (u1, u2), ti)= ×C 3C|cdom(u2, ti)|C ×bch(u1, ti)+C×bch(u2, ti)C |dom(u2)|C ()C(12)C In general, • dom(x): set of nodes in the dominance path of the target node x • cdom(x, ti): set of nodes in dom(x) that are covered by test case ti (i=1...p) • bch(x, ti): branch distance for test case ti (i=1...p) and target node x using Equation 9 If a killing node is traversed, a fitness value of 0 is assigned to the test case ti and it is discarded; otherwise Equation 11 or Equation 12 is used to compute the fitness value. Test case ti issaid to be optimal ifits .tness value is 1 i.e. the target is covered. Consider the def-use path# 5 (1, 7, 8) for coverage from Table 2. This is a dpu-path that tests for ‘Equilateral triangle’ condition. Node 1 (source) and the p-use edge (7, 8) (target) form the two objectives -their dominance paths to be covered by an input test case. There are three cases -if the dominance paths of both the nodes are covered, fitness value of the input test case is 1 and it is optimal. However, if a partial aim is covered (one of the two nodes) or none of the nodes is covered, fitness value of the input test case is computed using Equations 3.2 and 3.4. From Table 3, the dominance paths of the nodes are as given below: dom(d) = dom(1) = {1} dom(u1) = dom(7) = {1, 2, 6, 7} dom(u2) = dom(8) = {1, 2, 6, 7, 8} Case 1: Input test case t1 <2, 2, 2> Path traversed {1, 2, 3, 4, 5, 6, 7, 8, 12, 15} S. Varshney et al. Dominance path of the definition node (node 1) is covered. Dominance path of the first node of the p-use edge (node 7) is covered. Dominance path of the second node of the p-use edge (node 8) is covered. As the dominance paths of both the objectives are covered, the fitness value of the input test case using Equation 3.4 is 1; the input test case t1 is therefore optimal. Case 2: Input test case t2 <2, 2, 1> Path traversed {1, 2, 3, 4, 5, 6, 7, 9, 10, 12, 15} Dominance path of the definition node (node 1) is covered. Dominance path of the first node of the p-use edge (node 7) is covered. Dominance path of the second node of the p-use edge (node 8) is not covered; the critical node is node 7. The branch distance at node 7 using Equation 3.2 is bch (8, t2) = 0.91 The fitness value of the input test case using Equation 3.4 is ft (1, (7, 8), t2) = 0.91 Case 3: Input test case t3 <1, 2, 4> Path traversed {1, 2, 3, 5, 6, 12, 13, 14, 15} Dominance path of the definition node (node 1) is covered. Dominance path of the first node of the p-use edge (node 7) is not covered; the critical node is node 6. The branch distance at node 6 using Equation 3.2 is bch (7, t3) = 0.91 Dominance path of the second node of the p-use edge (node 8) is not covered; the critical node is node 7. The branch distance at node 6 using Equation 3.2 is bch (8, t3) = 0.91 The fitness value of the input test case using Equation 3.4 is ft (1, (7, 8), t3) = 0.74 This case study shows that the input test case t1 covers the selected def-use path# 5. The input test case t2 covers the def node and the first node of the selected def­use path# 5 (partial aim). The input test case t3 does not cover any of the two objectives for the selected def-use path# 5. Accordingly, ft (1, (7, 8), t1) > ft (1, (7, 8), t2) > ft (1, (7, 8), t3). Thus, the input test cases are also ranked according to their fitness values. 7 Experimental setup In this section, research questions, algorithmic parameters settings, details of the subject programs, and experimental results are provided. DE, PSO, GA and random search techniques are also implemented for comparison with the proposed hybrid (adaptive PSO and DE) algorithm. 7.1 Research questions The following research questions are formulated to evaluate the performance of the proposed hybrid algorithm: RQ1: How effective is the proposed hybrid generation to achieve 100% data-flow coverage of a (adaptive PSO and DE) algorithm for optimal test data program? Algorithm ATDG_Hybrid_PSO_DE Input: P : Instrumented version of the program under test arg = (a1,a2,…,ad): Argument list of P encoded into a d-dimension position vector DT : Dominator tree for the program P Paths : List of test requirements i.e. def-use paths Popinit : Initial random population of n particles Xi = [Xi1, Xi2…Xid] and their velocities V = [Vi1, Vi2…Vid] for i=1, 2…n c1, c2, Vmax : Algorithmic parameters of Particle Swarm Optimization (PSO) algorithm F, CR : Algorithmic parameters of Differential Evolution (DE) algorithm Output: TestSuite : Set of optimal test cases Pathstat : List of test requirements marked as ‘covered’ and ‘could not be covered’ (if any) Begin 1. Popold = Popinit 2. Popcur = Popinit 3. while some pathi in Paths is not marked { 4. while (termination criterion is not met) { //Either pathi is covered or MaxAttempts 5. for each particle i of Popcur { 6. Decode position vector Xi into a test case ti 7. if pathi is not marked { 8. Check pathi for coverage w.r.t. ti and calculate fitness value using Eq. 10 or Eq. 11 9. if pathi is covered { 10. Mark pathi as ‘covered’ (update Pathstat) 11. Add ti to TestSuite 12. } 13. } 14. for each pathj of TestReq other than pathi that is not marked { 15. Check pathj for coverage with respect to ti 16. if pathj is covered 17. Mark pathj as ‘covered’ (update Pathstat) 18. } 19. } 20. if pathi is covered 21. Go to line 3 22. else { 23. Update gbestij 24. for each particle i of Popcur { //Generate a new population Popnew 25. Calculate inertia weight w using Equations 3 and 4 26. Randomly choose two distinct particles k and l from Popcur (i.k.l) 27. for each dimension j (1.j.d) of particle i{ 28. Update pbestij 29. Randomly generate r between 0 and 1 30. if r Vmax open EV2 if EV2 blocked close open EV3 end end end end This system must avoid the overflow of the controlled tank. According to the received information from the sensor, if the volume in the controlled tank over crosses Vmax (V > Vmax) the computer actuates the electrovalves EV2 or EV3 of the system for draining the controlled tank; if the sensor identify that the volume in the controlled tank oversteps the upper limit Vmax and if the EV2 (blocked close) is out of service (EV2_HS), the EV3 it can be used to drain the controlled tank in the tank of draining. If EV2_HS and EV3_HS, we consider the overflow of the controlled tank. In this work we consider that only the electrovalves EV1, EV2, EV3 and computer (CP) can have failures (EV1_HS, EV2_HS, EV3_HS and CP_F (computer failed)) in the case of filing the controlled tank. L. Boucerredj et al. Figure 5: Case study. 4.2 Application of the proposed approach By applying the method described in section 4, the first step is the qualitative analysis optimization for deriving MFS in order to identify the causal events leading to the overflow of the controlled tank. 4.2.1 Qualitative analysis optimization The qualitative optimization is based on the simplified output expression (minimal cut sets) obtained from the Boolean reduction of TT method combined with KT as previously described in section 4. Our goal is to search the combinations of component failures causing system failure (overflow of the controlled thank). For constructing the TT of case study we star with the state of all components (All_OK) in the good condition (EV1 EV2 EV3 CP) = (1111). Then we list all combinations of operational and failure state of tree electrovalves and computer (2N=4 = 16 combinations), so we have the following table (Table 4). EV1 EV2 EV3 CP State Functioning (SF) 1 1 1 1 1 0 1 1 1 1 1 0 1 1 1 0 0 1 1 1 1 1 0 1 1 0 1 0 1 1 1 0 0 1 1 0 0 0 1 0 1 1 1 0 0 0 1 1 0 0 1 0 1 0 0 0 0 1 0 0 1 1 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 Table 4: Truth table of case study. In Truth Table the state (EV1 EV2 EV3 CP) = (0000) represent the overflow of the system (SF = 0). In this work the aim of the qualitative optimization is to determine the minimal cut sets (Minimal Feared State), by using the KT for generate the minimal number of feared state from TT. This is an efficient method to compute the Minimal Feared State of the system study based on the causality events of TT. So from the TT (Table 4) we construct the KT as shown in table 5. EV1 EV2 EV3 CP 00 01 11 10 00 0 0 0 0 01 0 1 1 1 11 1 1 1 1 10 0 0 0 0 This case represents the This case represents the initial state All_OK: failure state of EV1, (EV1_OK, EV2_OK, EV2, EV3 and CP EV3_OK and CP_OK) not (EV1_HS, EV2_HS, simplified with adjacent EV3_HS and CP_F). cases. optimization. It is composed by a set of functional modes and a set of transitions to which statistical information regarding the system dynamics has been added. This method permits the calculation of reliability or availability of a repairable system or no with failure rates to the constant values. It gives a representation of the causes of failures and their combinations that lead to the feared situation (overflow of the controlled tank), using us here the Software Reliability Workbench [26] for modelling the case study and for studies the quantitative optimization. Reliability Workbench is Isographs flagship suite of reliability, safety and maintainability software. So put the tree electrovalves and computer having a repair rates µ= 0.2 h-1 and a failure rates are respectively: .= 0.02 h-1 for the tree electrovalves EV1, EV2 and EV3; and .=0.05 h-1 for the computer (CP). Consequently from the results of qualitative analysis (equation 11), by using the causality events of TT and KT, we directly built the Reduced Markov Graph (RMG) represented in software Reliability Workbench for quantitative analysis as shown in figure 6. Table 5: Karnaugh table of the case study. From Karnaugh Table (see Table 5) we deduce the minimized Boolean expression form (see equation 11): (11) The minimal feared scenario (equation 11) deduced from KT is used not only in the qualitative optimization but in all the quantitative evaluations as well. The description of a scenario as given previously (in section 3.5) can be represented by Markov Graph, this allow drawing the reduced Markov Graph for quantitative optimization studied in the next section where the cercal are the events and the lines are the transition. 4.2.2 Quantitative analysis optimization To study the dependability of the system controlled by computer (case study), it is important, first, to model it. Therefore, the first part of the methodology that we have proposed is the qualitative analysis optimization which will provide us with all the necessary information about the operation and the dysfunction of the system study and the causal events leading to the feared state. Quantitative evaluations are most easily performed if the minimal feared state is obtained. The aim of this section is to complement our qualitative study by the quantitative analysis based on the construction of Markov Graph, which allows a limitation of the combinatorial explosion [13], [14], [16], [17]. This graph is directly constructed from the minimal feared states (Reduced Markov Graph) obtained from qualitative Figure 6: Reduced Markov Graph of case study. The tops correspond to the states of the system. The lines describe the transitions between these states and a rate of transitions whose value is a constant theirs is associated. The Reduced Markov graph represented in figure (6) shows the event combinations leading to the feared states (Overflow). This graph includes the minimal failure sequences leading to the feared events. The state All OK: all electro-valve and computer are in the good condition. The EV1_HS state: represents the failure of EV1 (EV1_HS) and EV2 and CP in good condition. The EV2_HS state: represents the failure of EV2 (EV2_HS) and EV1, CP in good condition. The EV3_HS state: represents the failure of EV3 (EV3_HS) and EV1, CP in good condition. The state "EV1, EV2 HS": represents the failure of EV1 and EV2, and EV3, CP in good condition. The state "EV1 and EV3 HS" represents the failure of EV1 and EV3, and EV2, CP in good condition. The state "CP_F" represents the failure of computer. The state Overflow corresponds to the failures of EV1, EV2, EV3 and CP ((EV1, EV2, EV3, CP) = (0000)), this sequence represents the overflow of the controlled tank (system state = 0). We have now defined the Reduced Markov Graph and can now proceed to perform an analysis. A direct simulation in software Reliability Workbench, with 100 points and a lifetime of 450h, we obtain the following results: Figure 7 shows the reliability of the controlled tank. Figure 7: Reliability of the controlled tank. A simulation shows that at time 200h the reliability of the controlled tank is: 0.11; at time 100h the reliability equal 0.33 and at time 50h the reliability equal 0.57. We can see that the reliability of the system depend on the failure states of components; it decreases rapidly as the number of failure components increases. Figure 8, shows the Failure Frequency (FF) of overflow of the controlled tank. From figure 9, we can see that at the time instant t = 200h, 100h and 50h respectively the CFI of the system equal: 0.011. As Figure 10 shows, the probability of overflow of the controlled tank is: 0.89 at time 200h; 0.67 at time 100h and 0.43 at time 50h. As confirmed by the results of the simulations we conclude that because the failure states of tree electrovalves (EV1, EV2 and EV3) and computer (CP), the probability of overflow of the controlled tank increases rapidly with time. 5 Conclusion In this paper we have proposed a new approach for optimizing the qualitative and quantitative analysis used for dependability evaluation of modern intelligent systems such as systems controlled by computer. The first step of our proposed approach is the qualitative analysis optimization, for deriving minimal feared scenario based on causality events of Truth table combined with Karnaugh table. It is a good tool to help understand the system functioning process and we can pick out the minimal feared scenario. Karnaugh Table process is more orderly process requiring fewer steps and always producing a minimum expression (minimal feared state) for dependability system. The combination of TT with KT presents two advantages. On the one hand, it allows a reduction of the feared state (minimal feared state), on the other hand, with the simplified output expression (reduced to fewer terms), we reduce the combinatorial explosion of the number of states of the Markov Graph (construct the RMG) for quantitative optimization. This allows for modeling complex systems, and to find the dependencies between failures. Reduced Markov graph permits the representation of state dependent behaviour, including different information of the nature of components (electronic, sensor, software,...) and system reparation. The quantitative evaluations are most easily performed if the minimal feared scenario is obtained. The advantage of reduced Markov graph lies in their ability to take into account the dependencies between components and the possibility to obtain various measurements from the same database modelling (Reliability, Availability, Maintainability, Security...). The simulation with Isograph Reliability Workbench verifies the effectiveness of our approach. References [1] Elena Dubrova, Fundamentals of Dependability, Chapter 2. Book, Fault-Tolerant Design, ISBN: 978­1-4614-2112-2. ©Springer 2013.XV, 185p. https://doi.org/10.1007/978-1-84996-414-2 [2] LászlPokorádi. Failure Probability Analysis of Bridge Structure Systems. 10th Jubilee IEEE International Symposium on Applied Computational Intelligence and Informatics. Timişoara, Romania, May 21-23, 2015. https://doi.org/10.1109/SACI.2015.7208220 [3] Albert Myers, Complex System Reliability. Springer-Verlag, London, 2010. https://doi.org/10.1007/978-1-84996-414-2 [4] Hamid Demmou, Sarhane Khalfaoui, Edwige Guilhem, Robert Valette. Critical scenarios derivation methodology for mechatronic systems. Reliability engineering and system safety, 84 Elsevier. 33-44, 2004. https://doi.org/10.1016/j.ress.2003.11.007 [5] CS 410/510 -Software Engineering. System Dependability. Reference: Sommerville, Software Engineering, 10 ed., Chapter 10. [6] Fabrice Guerin, Alexis Todoskoff, Mihaela Barreau, Jean-Yves Morel, Alin Mihalache, Dumon Bernard. Reliability analysis for complex industrial real-time systems: application on an antilock brake system. IEEE International Conference on Systems, Man and Cybernetics, Hammamet, October 6-9, 2002. https://doi.org/10.1109/ICSMC.2002.1175666 [7] Cristina Johansson. On System Safety and Reliability in Early Design Phases: Cost Fo cused Optimization Applied on Aircraft Systems. Linkping University Electronic Press, Sweden. Thesis, ISSN 0280-7971; 1600. 2013. p. 62 URN: urn:nbn:se:liu:diva-94354 [8] Pierre-Yves Piriou. Contribution to model Based Safety Analysis for dynamic repairable reconfigurable systems. Paris-Saclay University. Thesis presented at ENS Cachan, 27/11/2015. https://tel.archives-ouvertes.fr/tel-01251556 [9] Krishna B. Misra. Handbook of Performability Engineering. Book. Springer-Verlag London, 2008 https://doi.org/10.1007/978-1-84800-131-2 [10] Manno, Gabriele Antonino. Reliability modelling of complex systems: an adaptive transition system approach to match accuracy and efficiency. PhD Thesis, University of Catania, 2012. http://archivia.unict.it/bitstream/10761/1039/1/MNN GRL82L03C351S-PhD_Thesis_GM_A.pdf [11] Norman B. Fuqua. The applicability of Markov analysis methods to Reliability, Maintainability, and Safety. Selected Topics in Assurance Related Technologies, Vol. 10, N. 2. Reliability Analysis Center, 2003. https://www.dsiac.org/sites/default/files/reference­ documents/markov.pdf Informatica 42 (2018) 439–450 449 [12] IEC 61165. Application of Markov techniques. International Electrotechnical Commission. 2006. [13] Bateman. K. A., Cortes. E. R. Availability Modeling of FDDI Networks, Proceedings of Annual Reliability and Maintainability Symposium, IEEE. pp. 389-395, 1989. https://doi.org/10.1109/ARMS.1989.49632 [14] Kaufman. L.M., Johnson. B.W. Embedded Digital System Reliability and Safety Analyses. NUREG/GR-0020. University of Virginia. Department of Electrical Engineering Center for Safety-Critical Systems -Thornton Hall Charlottesville, VA 22904. xi, 75 p. 2001. [15] Paraskevas Stavrianidis. Reliability and Uncertainty Analysis of Hardware Failures of a Programmable Electronic System. Reliability Engineering and System Safety, Elsevier, vol. 39, issue 3, pp. 309­324, 1993. https://doi.org/10.1016/0951-8320(93)90006-K [16] Raphaël Schoenig. Definition of a design methodology for mechatronic systems including dependability analysis. PhD thesis of the National Polytechnic Institute of Lorraine, 2004. https://tel.archives-ouvertes.fr/tel-00126057 [17] Salem Derisavi, Peter Kemper, William H. Sanders. Lumping Matrix Diagram Representations of Markov Models. International Conference on Dependable Systems and Networks. Yokohama, Japan. IEEE, pp. 742–751, 2005. https://doi.org/10.1109/DSN.2005.59 [18] Way Kuo, Xiaoyan Zhu. Relations and generalizations of importance measures in reliability. IEEE Transactions on Reliability, Vol. 61, N. 3, pp. 659–674, 2012. https://doi.org/10.1109/TR.2012.2208302 [19] Sally Beeson, John D. Andrews. Importance measures for noncoherent-system analysis. IEEE Transactions on Reliability, Volume 52, issue: 3, pp. 301–310, 2003. https://doi.org/10.1109/TR.2003.816397 [20] Elena Zaitseva, Vitaly Levashenko, Jozef Kostolny, Miroslav Kvassay. Algorithms for Definition of Minimal Cut Sets in Reliability Evaluation of Green IT System. Department of Informatics, University of Zilina, Zilina, Slovakia. 2015. https://www.pdffiller.com/jsfiller­ desk5/?projectId=226202130&expId=3950&expBra nch=1#834b8f1bbf854c3e9f4c996e3b01e38a [21] Alain Villemeur. Dependability of industrial systems. Collection of the Direction of Studies and Research of Electricity France, ISSN 0399-4198, Volume 67, 795 pages. Eyrolles, 1988. [22] Pankaj Bansod. System Reliability and Challenges in Electronics Industry. SMTA Chapter Meeting 25th September 2013, India. https://pdfs.semanticscholar.org/presentation/64e3/b 4774be3dad7f988fb5893a1a174e6cfabfa.pdf [23] Popov Peter, Manno Gabriele. The effect of correlated failure rates on reliability of continuous time 1-out-of-2 software. International Conference on Computer Safety, Reliability, and Security, SAFECOMP 2011. Lecture Notes in Computer Science, vol. 6894, Springer, Berlin, Heidelberg, pp. 1-14, 2011. https://doi.org/10.1007/978-3-642-24270-0_1 [24] Peter Cheung Professor. Lecture5: Logic Simplification & Karnaugh Map. Department of EEE. Lecture 5 -Imperial College London. 2007. [25] Enrico Zio. Reliability engineering: Old problems and new challenges. Reliability Engineering & System Safety, Elsevier, Vol. 94(2), pp. 125–141, 2009. https://doi.org/10.1016/j.ress.2008.06.002 [26] https://www.isograph.com/software/reliability­ workbench/2013. Bio-IR-M: A Multi-Paradigm Modelling for Bio-Inspired Multi-Agent Systems Djamel Zeghida LISCO Laboratory, Department of Computer Science, 20 Aot 1955, Skikda University, P.O. Box 26 Route El Hadaeik, Skikda, 21000, Algeria dj.zeghida@gmail.com Djamel Meslati and Nora Bounour LISCO Laboratory, Department of Computer Science, Badji Mokhtar, Annaba University, P.O. Box 12, Annaba, 23000, Algeria meslati_djamel@yahoo.com, nora_bounour@yahoo.fr Keywords: bio-inspired system, agent-oriented software engineering, infuence/reaction principle, biomorphic system Received: January 31, 2017 Nowadays bio-inspired approaches are widely used. Some of them became paradigms in many domains, such as Ant Colony Optimization (ACO) and Genetic Algorithms (GA). Despite the inherent challenges of surviving, in the natural world, biological organisms evolve, self-organize and self-repair with only local knowledge and without any centralized control. The analogy between biological systems and Multi-Agent Systems (MAS) is more than evident. In fact, every entity in real and natural systems is easily identified as an agent. Therefore, it will be more efficient to model them with agents. In a simulation context, MAS has been used to mimic behavioural, functional or structural features of biological systems. In a general context, bio-inspired systems are carried out with ad hoc design models or with a one target feature MAS model. Consequently, these works suffer from two weaknesses. The first is the use of dedicated models for restrictive purposes (such as academic projects). The second one is the lack of a design model. In this paper, our contribution aims to propose a generic multi-paradigms model for bio-inspired systems. This model is agent-based and will integrate different bio-inspired paradigms with respect of their concepts. We investigate to which extent is it possible to preserve the main characteristics of both natural and artificial systems. Therefore, we introduce the influence/reaction principle to deal with these bio-inspired multi-agent systems. Povzetek: Avtorji prispevka analizirajo podobnosti med biološkimi in multiagentnimi sistemi in predlagajo Bio-IR-M, integrirano shemo, ki zajema tako genetske algoritme kot npr. modele, temelječe na mravljah. Introduction In computer science, bio-inspired approaches are getting a particular interest. Their mechanisms and their behavioural, functional or structural features remain favourable felds of study and inspiration for multidisciplinary researches. Therefore, most researchers agree that both natural and bio-inspired systems are complex. In each system distribution and decentralization are inherent features. We see now a large emergence of bio-inspired systems. These systems, inspired from nature and living organisms, extract metaphors for solving complex problems, getting new dimensions for systems we design. Some of these bio-inspired approaches became paradigms in many domains such as in hard optimization as heuristics [16], highlighting by the way the ACO meta-heuristic of Dorigo [15]. We can fnd early examples as use cases for instance in optimization with an evolutionary approach [23] or with a swarm intelligence (SI) using Ant Colony (AC) [10]. Other examples are presented for the use of Artifcial Neural Network (ANN) in control and decision systems [35] or for object-class detection (specifcally face detection) [53]. While [58, 59] present respectively the use of Artifcial Immune Systems (AIS) and AC in the security domain. Although, [63] presents the use of AC intelligence with agent for scheduling and [4] illustrates routing with GA. Some recent applications can be found, such as a parallel extended algorithm for the Ant Colony algorithm in [27] and Particle Swarm Optimization (PSO) algorithm in [37]. We can cite two others applications of the ACO Meta-heuristic for resource discovery in a grid using the agent technology [46] and for home automation networks [60]. A mobile agent Ant Algorithm (AA) has been used in an Ant-Based Cyber Defence system [21], when a hybrid Ant-Bee Algorithm was used for multi-robot coverage in [7]. For multi-objective optimization, we found the use of a Bat Algorithm (BA) in [2] and an evolutionary algorithm in [49]. In vision, Artifcial Neural Networks (ANN) are used for place recognition [11]. This proliferation is mostly due to technological and methodological advances in application areas and a better understanding of biological natural mechanisms. Historically, the evolution of any approach or paradigm must be accompanied by a methodological evolution to carry the design side. Therefore, the need for an associated and specifc bio-inspired modelling is becoming increasingly urgent. Such unifed abstract representation will, at least, help overcome the lack of reuse in this domain. A straight analogy can be easily identifed between natural and multi-agent systems. Formally, we clearly distinguish two levels as follows: -Micro level: held by the Agent concept in MAS and by an individual in natural system. An agent and an individual are both: autonomous, reactive, proactive and social. -Macro level: referring to the MAS concept (an aggregate of interacting agents) and to a sub­system or to the entire natural system. In both systems we can fnd a set of features such as: diversity and distribution of knowledge, decentralization of data, distributed control, asynchronous calculations and processing, e.ciency of parallel treatments, robustness, fault tolerance and dependability, fexibility, sophisticated plans of interaction (cooperation, coordination and negotiation), asynchronous local communication and emergent functionalities. In this paper we focus on the modelling issue. We show the interest of a dedicated multi-paradigm model for bio-inspired multi-agent systems. In fact, by exploiting the evident analogy between biological and multi-agent systems and highlighting the fact that these agent/multi-agent concepts are a common denominator for bio-inspired paradigms; it is quite natural to model these systems using autonomous agents. With regard to this perspective we suggest a unifying and generic infuence/reaction agent model for several bio-inspired paradigms. In Section 2, we give the background used in this paper that presents natural/multi-agent systems and the infuence/reaction principle. Section 3 gives some refections and analysis on Agent/Actor/Object concepts and the micro/macro levels in both MAS and bio-inspired paradigms, and then we show the convenience of using the agent concept as a generic model. All this help to position our contribution. Section 4 presents the concept of bio-inspired design. Throughout Section 5, we focus on the details of our proposed generic infuence/reaction agent model, which is based on an explicit environment model and a separate interaction module. Section 6 presents some case studies. We discuss related works in Section 7, and Section 8 concludes the paper. 2 Background This section provides the basic concepts and features of natural and multi-agent systems, highlighting the infuence/reaction principle. 2.1 Specifcities of natural systems If we consider any ecosystem or biotope we can see that several autonomous species cohabit together with various complex interactions and interdependencies. Biologists defne the biotope as a small box with a separate set of environmental conditions (climatic and geological) that supports an ecological community composed of plants and animals. In a biotope, interdependence is complex and species survival depends on it. It is important to notice that all the biotope forms a coherent system and that various species cohabit while they di.er greatly in terms of mechanisms and behaviours. These species require continuous changes of organization: decomposition/aggregation to face these very constraining and changing environments (Figure 1). Figure 1: Canonical view of a complex natural system [30]. Note that distribution and complexity are innate features of these systems rather than casual. These systems are Auto organized Group of individuals. These last are Autonomous, Simple and Cooperative, put together in local communication to perform Complex operations in a Distributed and Parallel manner. Where the behaviour shown by the group is not explicitly programmed in the members but emerges from their interactions. These members join and leave freely the group in continual change. All this is performed without any central control [36]. With all this chaos and anarchic interactions, the organization continues to grow, to live, to adapt and repair itself. 2.2 Multi-agent systems The multi-agent systems are based on the distribution of knowledge and control, spread over a set of entities called agents. MAS are a metaphor of social organization [9]. Agent technology comes from several felds: artifcial intelligence, software engineering and human machine interfaces. According to J. Ferber [19],”an agent is an autonomous entity, real or abstract, which can act on itself and its environment, which, in a multi-agent universe can communicate with other agents, and whose behaviour is a consequence of its observations, knowledge and interactions with other agents”. An agent is mainly [29, 62]: - Autonomous: its behaviour is guided by objectives; it has an internal state on which it holds total control. This internal state is particularly inaccessible to other agents. Furthermore, the agent makes decisions that are based on this internal state without external intervention (human or other agent). - Reactive: an agent is situated in an environment. It is able to perceive this environment and respond to events in it by its actions. - Social: An agent is a social entity in the sense that it is able to interact and communicate with other agents through its environment. - Proactive: an agent does not just react to its environment, but it is also able to produce self- actions motivated by its own goals (agent takes initiative). An agent may be: reactive, cognitive or hybrid. MAS based on reactive agents are characterized by a large number of simple agents, by emergence and eco-resolution. MAS based on cognitive agents are characterized by a small number of intelligent agents, by coordination, negotiation and cooperation. In this case, the system depends on the agents’ intelligence. When multi-agent systems are based on reactive agents (not intelligent), they depend on the agents’ interactions to get intelligent collective behaviour. It defines a particular kind of Distributed Artificial Intelligence called Swarm Intelligence (SI). In such systems, intelligent functionalities (which haven’t been explicitly coded in the system) can emerge throughout the agents' interactions. MAS are usually characterized by: -Diversity and distribution of knowledge: each agent has information and limited problem solving abilities (incomplete information and limited scope of action), and each agent has a partial view of the system, -Decentralization of data, -Asynchronous calculations and processing, -Distributed control: there is no overall control of the system, -E.ciency of treatments: the agents work in parallel and communicate asynchronously, -Robustness, fault tolerance and dependability: the disconnections of some agents do not substantially a.ect the overall behaviour of the system, -Flexibility: we can always increase (or decrease) the number of agents to treat larger and larger systems, without disturbing the work of existing agents who can adapt themselves, -Sophisticated plans of interaction: they include cooperation, coordination and negotiation, Informatica 42 (2018) 451–466 453 -Ideal for representing problems with multiple solution methods, multiple perspectives and/or multiple solvers. They have the traditional advantages of distributed and concurrent resolution of problems such as modularity, speed (with parallelism) and reliability (due to redundancy). 2.3 The infuence/reaction principle Besides being solution for simultaneity, the Infuence/Reaction principle provides bases of good agent modelling/programming [41, 42] to accomplish more formally some aspects of the agent paradigm. As a modelling principle, the Infuence/Reaction principle has been defned for its ability to model concurrency behaviour but its interest goes beyond this objective. First, it gives a true semantic to the interactions management during the reaction phase (through infuence). It, also, avoids the representation of action as a direct change in the global states of a system. This model can provide truly autonomous agents, requiring a clear distinction between the state variables of the agent decisional system (its mind) and variables relating to its physical appearance that are part of the environment (its body). The mind’s variables are accessed/modified only by the agent and only during the Infuence phase when the body’s variables can be changed only during the Reaction phase by this environment [41, 42]. 2.3.1 The infuence/reaction principle for modelling simultaneous actions Focusing on the autonomous nature of these entities, the simultaneity of action is an inherent characteristic of the agent paradigm which is, in addition, di.cult to implement adequately. Constrained, agents must not have the control over the consequences of their actions, only the environment has the ability to compute them and for which the internal structure of an agent will stay unreachable. The infuence/reaction principle is a solution for modelling simultaneous actions [17, 41, 42]. In two points, this principle is summarized in the fact that: 1. Agents do not have direct control over the result of their actions; 2. All the infuences produced at a moment must be known to compute the new state of the world. Every application of this principle will provide a model for its implementation. 2.3.2 The infuence/reaction principle for modelling interactions In Figure 2, let us denote .(t) the dynamic state of a system at time t and .1, .2 two infuences produced at this time. The new state .(t+dt) is given by the reaction function (equation1): += ,, 1 The parallel character of actions can be mandatory or optional depending on situations (see more details in [41, 42]). In a mandatory parallel case, we have parallel reactions, requiring an explicit behaviour composition. To preserve the coherence of the system and to ensure the decisional autonomy of all involved agents, we calculate the reaction of the environment by treating all their infuences simultaneously as a unit (equation 2): += , 2 Figure 2: Illustration of the Infuence/Reaction principle [41]. In the second case, the parallel character is no longer an obligation (it is just a modelling choice). Now we have serial (non-parallel) reactions. Both coherence of the system and the agent autonomy will not be compromised by the process used; we can use the equation 2 or we decompose the overall computing in elementary and independent reactions. We execute them in sequence one after another (equation 3 then equation 4). So we calculate frst: = , 3 And then: += , 4 (or .2 then .1) We have to conclude here that the use of an Infuence/Reaction model in the treatment of interactions calls for a separate interactions module. Analysis and refections We notice that the use of the term approach refers to a vision or process to face or to deal with an issue, we can call it paradigm when it is well defned and widely used (for instance, agent/object are both paradigms in many domains, when we qualify them as approach, we mean the global vision and the way they proceed). 3.1 The challenge Knowing the multitude and variety of bio-inspired paradigms available today (Table 1), it would be interesting to seek a unified approach for their design. In Artifcial intelligence, think bio is sometimes like to think multi-agent system, and think MAS is to think modelling and simulation. This transitivity of MAS is a natural bridge between the real world and the simulation and modelling in data processing. That is a generalisation of what was attested for immunology by Bakhouya [3]. So, for biology and MAS, the support is mutual. Biology supports MAS in particular and the feld of computer science in general, by providing artificial systems with principles, processes and mechanisms available in biological systems. This is achieved through biological metaphors as analogies established between the biological world and the artifcial world, in order to propose approaches mimicking some aspects of the natural world while ignoring others. An historical overview of bio-inspired approaches can be found in [36]. Basically, the metaphors do not try to reproduce what is biological, but rather to interpret it in terms of what it is possible and reasonable to do. Thus, we can conclude that biological metaphors are evolving and depend on our understanding of reality and on our ability to extract benefcial and practical elements. Paradigms Metaphor Inspiration’s Nature Artifcial Neural Network (ANN ) Brain structure & functioning Structural & Functional. Genetic Algorithm (GA) Genetic mechanisms Functional. Fuzzy System (FS) Human reasoning Functional. Artifcial Immune System (AIS) Operating & organisational mechanisms of immune cells Structural & Functional. Ant Colony Optimization (ACO) Ant colony behaviour Behavioural. Particle Swarm Optimization (PSO) Swarm of bird in fight behaviour Behavioural. Table 1: Description of some bio-inspired paradigms. On the other side, MAS allow the construction and design of complex systems highly distributed and adaptable to environmental changes. MAS o.er to the biologists the ability to model and simulate, as simple as possible, complex natural systems (cells/molecules in interaction, insects, birds, fsh or other living organisms) providing a reproduction of a natural phenomena through computers to: -Understand their processes/mechanisms. -Identify new metaphors: computation / memorisation models or resolution / optimization tools. We have to notice that natural systems are by definition Open Systems, so must be artificial (bio-inspired) systems. Beside their innate characteristics (Section 2.1), an Open System must have the three flowing characteristics: 1. The number of the system’s components can change; the system accepts new components and allows departure of existing ones. 2. The system’s organizational structure can change; there is no predefined and fixed organization to respect, components can form and dissolve aggregations and groups freely. 3. The two previous characteristics must be performed within “running” (in action) system. The two first characteristics are enough in nature to define an Open System. The third characteristic can be ignored in living organisms and “organizations”, because it is naturally verified: The ecosystem will not be constrained to stop or even wait the changes of its structure and the number of its components. In artificial world (such in computer science), the third characteristic is very important. We can change the structure and the number of a system’s components by modifying its code when it is stopped; in this case the system is not Open. To be Open, the two previous changes must be observed within a running system (system in execution). Agent-Oriented Software Engineering (AOSE) has evolved to include the following high-level themes: methodologies, architectures, framework implementations, programming languages, and communication (Figure 3). Our contribution aims to address the modelling issue in the Agent Oriented Methodologies theme. Figure 3: Agent-Oriented Software Engineering thematic map [55]. Mainly, a design methodology will include: 1. Models: Abstract representations of the real world or a part of it; 2. Tools: Means to represent, to manipulate and to implement the models; 3. Process: Coordinated set of steps, phases and tasks showing the path to achieve the system design. For a precise positioning of our contribution, we summarize in Figure 4 and Figure 5, what has been Informatica 42 (2018) 451–466 455 already done in particular computer science felds and what remains to be done. Figure 4 depicts the combined/separated use of bio-inspired approaches and Agent/multi-agent concepts in the feld of Distributed Artifcial Intelligence (DAI) or traditional Artifcial Intelligence (AI). The case (a1), illustrates the use of bio-inspired resolution/optimization tools (Algorithms: computation / memorisation models or resolution / optimization tools) to solve problems. All examples and applications cited in Section 1 belong to this case (except where it has been mentioned the use of agent). The case (b1), illustrates the use of bio-inspired Agent/multi-agent modelling/simulation tools (Platforms) to model/simulate bio-inspired multi-agent systems. For instance, the use of Turtlekit tool in Madkit platform [26] for simulating artifcial life/reactive systems and the use of Repast platform for simulating social science applications [22]. It can illustrate, too, the use of bio-inspired Agent/multi-agent tools and models (Algorithms) such in [7, 21, 46, 59, 63]. The case (c1), illustrates the use of Agent/multi-agent modelling/simulation tools (Platforms) to model and simulate multi-agent systems. Gama, NetLogo and PRESAGE2 are examples of still used agent simulation platforms [22]. Figure 4: Bio-inspired approaches’ use in AI/DAI feld combined or not with MAS. Figure 5 depicts the combined/separated use of bio-inspired approaches and Agent/multi-agent concepts in the feld of Software Engineering (SE). The case (a2), illustrates the use of bio-inspired Ad hoc methodologies (models/process/tools) to develop bio-inspired systems. It concerns most of developed bio-inspired systems. Figure 5: Bio-inspired approaches’ use in SE feld combined or not with MAS. The case (b2), illustrates the use of bio-inspired agent/multi-agent methodology (models/process/tools) to develop bio-inspired multi-agent systems. There is no model nor methodology to deal with this case [24, 25, 34, 44, 50, 51]. Otherwise, we can fnd, only, methodology supporting a one target feature (for example Adelfe agent methodology supports emergent functionalities) [6]. It is the case that needs improvement, and where we aim to contribute in this paper. The case (c2), illustrates the use of Agent/multi-agent methodology (models/process/tools) to develop multi-agent systems. For instance, we can cite the AGR and AGRE organisational models [18, 20]. For the methodologies we have, for instance: Gaia, MaSE, O-MaSE, Passi, Prometheus, INGENIAS, Tropos [22, 47, 56]. Some examples of their application can be found in [38, 40, 54], when others don’t mention, at all, any methodologies [33, 52, 63]. 3.2 Agent versus Object and Actor As a modelling concept, to overcome the passive nature of Object, the less known concept of Actor was launched. In Table 2, we situate the Agent with regard to the well-known and widely used concept of Object. The Actor concept is a mathematical model of concurrent computation used for several practical implementations of distributed systems. It was built with a main added value; its asynchronous behaviour (Figure 6). The Actor concept initiated by 1973 was left out and ignored for decades. It has been relaunched frst by Gul [1] and after that by Karmani and Gul [31, 32]. The Agent concept overtakes the Actor by its skills in interactivity (Figure 6). The three concepts became paradigms in computer science domain. Comparison Object Agent criteria approach approach Nature Passive Active and Autonomous State/behaviour realization Encapsulate Encapsulate Behaviour Don’t Encapsulate activation encapsulate Generic system functions Focus Neglect Describing Primitive Advanced interaction’ mechanisms mechanisms types Patterns of Rigid and Flexible and interaction mandated sophisticated Means of abstraction Insu.cient Su.cient Specifying Minimal Advanced and managing support (static support organizational inheritance relationships hierarchies) Modelling Not Supported complex supported by concepts systems /mechanisms Table 2: Comparing Object & Agent approaches. If we take the most important and illustrative features; Intelligence and intermediation, Figure 6 depicts the places of the three paradigms Agent, Actor and Object together. This fgure was inspired from a graphic description of Agent type and functionalities. It has been later refned in [66] and extended here to Actor and Object. Figure 6: Positioning Agent, Actor and Object concepts according to the intelligence and intermediation features. On the intelligence axis, both three paradigms can deal with this feature more or less easier. On the other axis, we distinguish an inclusion relationship (Figure 7). Indeed, Objects cannot even deal with the frst step: Asynchronism, which is well-handled by Actors. Agent reaches farther steps, with its sophisticated means of communication preferably named agent interaction. In agent interaction, we distinguish an indirect mode, used only for limited coordination (pheromones in ant colonies) and a direct mode. The latest is widely used ranging from: agent language (KQML for Knowledge Query and Manipulation Language, ACL-FIPA for Agent Communication Language, proposed by the Foundation for Intelligent Physical Agents), ontologies and a communication support (present in agent platforms such as JADE [5] or MadKit [26]). The direct mode is structured using; protocols, dialogue games or argumentation systems [28]. Figure 7: Object, Actor and Agent inclusion. 3.3 The analogy between biological systems and MAS The frst observation is the analogy between biological systems and MAS, and the mutual support of each. For instance, some bio-inspired approaches are easily identifed to an aggregate of agents and have, so, a straight analogy with the MAS concept (the macro level). Others can be used in agent (the micro level) as a computational model held by the agent concept, as shown in Table 3. Micro level (Agent) -Artifcial Neural Network (ANN) -Genetic Algorithm (GA) -Fuzzy System (FS) Macro level (MAS) -Artifcial Immune System (AIS) -Ant Colony Optimization (ACO) -Particle Swarm Optimization (PSO) Table 3: Classifcation of bio-inspired paradigms. Note that for a particular use and specifc abstraction need, we can use a micro level as a macro according to Table 4. For instance, with a functional metaphor (Table 4), GA was classed in the micro level (Table 3), but with deeper abstraction level it can be used in a macro level, where every genotype, for instance, will be hold by an agent. Nature of the metaphor Micro level (Agent) Macro level (MAS) Functional Ok Structural Ok Behavioural Ok Table 4: Classifcation of bio-inspired paradigms according to their metaphor’s nature. 3.4 The unifying formalism The idea of using a unifying formalism to deal with the diversity of specifc concepts to the considered paradigms became more obvious. Rather than proposing an approach that is the sum of the various concepts, or try to merge similar concepts, our vision of a unifying formalism is to wrap the various concepts by basic concepts and to operate, thereafter, a successive refnements that can be conducted in the specifc contexts to each bio-inspired paradigm. 3.4.1 Adequacy of the agent approach for the development of natural systems The multi-agent systems beneft from the e.ort of a wide scientifc community relying on the fact that their approach adapts to various levels of abstraction. Indeed, from cognitive complex agents to very simple reactive agents, it is possible to model very di.erent realities. In [48], criteria that characterize bio-inspired MAS approach were proposed (Table 5). Some of these characteristics refer to the micro level, which is the individual component (agent level) and others to their aggregate (multi-agent level). Criteria Nature Agents must correspond to entities and not to abstract functions. Micro level: Agent Agents should be small in size (system’s parts), in time (able to forget) and in scope (avoid global knowledge/actions). Micro level: Agent Agents’ community should be decentralized, with no single point of control or failure. Macro level: MAS Agents must be diverse. Randomness and repulsion are important tools for the maintenance and stabilization of this diversity. Micro level: Agent Agents’ community should include mechanisms for disseminating information to increase its agents’ reactivity. Macro level: MAS Agents must have means to capture and share what they know/learn. Micro level: Agent Agents plan and run in concurrent and parallel way. Micro level: Agent Table 5: Characteristics of bio-inspired multi-agent approach. Many arguments have been given in favour of the use of agent-oriented approaches for the design of complex natural systems [30]. The role of engineering software is to provide the structures and techniques that facilitate the management of their complexity. It is in this perspective that software engineers have developed a number of fundamental tools in the feld, referring to decomposition, abstraction and organization. Let us see the contributions of agent approach for each point [30]. 1. Advantage of agent-oriented decompositions: Limiting the scope and extent of the designer, the decomposition is the basic technique that helps to counter big problems and their complexity, by dividing them into smaller parts, manageable and treatable in a relatively separated way. It is apparent that the natural way to model a complex system is based on several independent components that can act and interact in a fexible way to achieve their objectives. The agent-oriented approach seems to be the best choice. 2. The convenience of agent-oriented abstractions: Limiting, at a given time, interest and visual feld of the designer, the process of defning a simplifed model of the system, helps to overcome its complexity, by focusing on some details and ignoring others. In the case of complex systems composed of subsystems, components of subsystems and organizational relationships, it is natural to match the sub-systems to agent organizations, the components of subsystems to agents and interaction between subsystems and between their components will be viewed in terms of high-level social interactions. 3. The need for fexible management of changing organizational relationships:O.ering the ability to specify and adopt organizational relation-ships, the process of defning and managing interactions between di.erent components of problem solving (sub-systems and interaction links), helps designers to deal with complexity by allowing the grouping of components, to treat them as a unit of high-level analysis and to provide means for describing high-level relationships between various units. Agent-oriented systems have mechanisms for concurrent computing to form, maintain and dissolve organizations fexibly. The multi-agent systems became a new technology for the design and control of complex, fexible and scalable systems. 3.4.2 The environment in bio-inspired multi-agent systems In AEIO Vowels model [12]; Da Silva distinguishes four dimensions for MAS: Agent, Environment, Interaction and Organization. We notice that the environment component has been identified as a key element for MAS [61]. For bio-inspired systems, this component is of vital importance. This is the place where agents must co-exist and interact with the ability to form, maintain and dissolve organizations. All this changes can take place only through the environment [61]. Parunak [48] emphasizes a real consideration of the environment for ”natural” MAS. In this context, he establishes that such system can be defined as three components: MAS ={Agents, Environment, Coupling} Where an Agenti is a set of four elements as follows: Agenti ={A.statei, A.inputi, A.outputi, A.processi } D. Zeghida et al. The Environmenti (as a scoop of Agenti) is composed by two elements: Environmenti =< E.statei, E.processi > The exact nature of the Coupling depends on how we model agents and environment states and process. This coupling can be very complex. When agents and environment are discrete events, the Coupling of the A.inputi and A.outputi to E.statei is simply a mapping of agent and environment states. This kind of representations, dominating in the artificial intelligence area, is criticized because it generates unrealistic situations. A solution proposed for this is: the infuence/reaction principle [17, 41, 42]. Obviously the autonomous of entities and simultaneity of their actions is crucial for natural MAS. So a direct validation of actions is to be avoided in such approaches. In respect of these requirements, we propose the use of the infuence/reaction principle to deal with bio-inspired multi-agent systems. 4 Biomorphic systems Nowadays we often speak about bio-inspired or biomorphic systems. Let us see their appropriate signifcations. 4.1 Origins The biomorphic (biology-morphology) term was coined by the British zoologist Desmond Morris to describe the bio-inspired software approach [36]. Let us recall that a biological metaphor is an analogy sought to be determined between artifcial and biological worlds, in order to provide tools which mimic some aspects of real world. The result of such process is a bio-inspired system. A biomorphic system is simply designed based on algorithmic concepts inspired from biological systems and processes: (Biomorphic = Bio-inspiration + Design). Consequently when we speak about development, design or modelling, we precisely use the term bio-inspired instead of biomorphic which include implicitly a process and structure. 4.2 Premises of a bio-inspired design The premises of any development process of biomorphic systems fall into two points: 4.2.1 Characterization of bio-inspired design We had to identify the core processes and to formally describe their computational model. Since there are many paradigms, it is important to distinguish the basic paradigms and hybrid/composed ones. Lodding [36] explains that a biomorphic system is the result of a bio-inspired design for a given system. It is designed based on concepts inspired from biological systems and processes. However, it is not easy to identify structural features for stating that a given architecture is bio-inspired. To address this issue, several criteria have been identifed to characterize the behaviour of biomorphic systems [36]. These criteria emphasize that a biomorphic system is materialized by a multitude of autonomous entities that collaborate. Table 6 depicts them and suggests their nature. Criteria Nature The system behaviour results from the collective interaction of several independent and similar entities. Macro level: MAS The system behaviour emerges from the interaction of entities without being explicitly described in them. Micro level: Agent Entities act autonomously. Macro level: MAS The entities are operating based on local information and interactions and their spatial scope is rather local. Micro level: Agent The entities appear and disappear freely according to the system changes (free evolution of the group). Micro level: Agent The entities are able of self-adapt and adjust to changing objectives, knowledge and conditions. Micro level: Agent The entities have the ability to evolve over time. Micro level: Agent Table 6: Characteristics of a bio-inspired design. As said for the Parunak’s characteristics (Table 5) these characteristics can be classified in two categories; atomic characteristics; referring to individuals and composed one, referring to a group of individuals (their aggregate). 4.2.2 Characterization of the context of applicability The context of applicability, of each basic bio-inspired paradigm, help to reach a state where knowing specifc criteria on a given problem, it will be possible to choose the bio-inspired paradigm to apply or indicate possible combinations (that suggests a multi-paradigm approach). 4.3 Consequences of a bio-inspired design Based on the previous two premises, when we are interested in some way by a bio-inspired multi-paradigm development approach, it should be noticed that biomorphic aspect concerns the whole life cycle of a software system. On requirements phase which is supposed to deliver the system functional and non-functional requirements, a preliminary determination of bio-inspired paradigm to use for each requirement or group of requirements is necessary. At this level we can, for example, determine that a particular requirement has characteristics that suggest the use of ant colony optimization or using a neural network classifcation. Determining the appropriate bio-inspired paradigm for a given requirement is closely linked to the premises previously introduced. The design phase is a key phase. In architectural design, this phase allows to decompose the system into subsystems and to determine the role played by each one and interactions that must exist between the subsystems. For this, we must frst determine the main bio-inspired paradigm to use according to the main system requirements. Based on these requirements, it is possible that none of the basic bio-inspired paradigm matches and, at that time, it would be advisable to consider combinations (hybridization). The second step in design is the detailed design. If a subsystem must comply with a bio-inspired paradigm given its detailed design should specify inputs and outputs and the necessary adjustments to implement this paradigm. 4.4 The need for a multi-paradigm approach Natural systems are by definition typically complex. This complexity is not only due to the multitude of entities that form their operational system, but also to the diverse nature of these entities and the varied interactions they may have. It is su.cient for realizing it to consider an operating system and the various devices it manages, an Intranet and nested protocols which keep it operational, or an air or rail tra.c management system. 4.4.1 Analogy with artifcial systems From an organizational point of view and having in mind the image of a biotope, an artifcial system may be composed of interdependent subsystems where each is governed by a biological metaphor, provided by a given paradigm. The underlying interest in this approach is to take advantage of the best paradigms for each problem. So, it is a synergy of the various paradigms that we want to achieve. In turn, the subsystems can be decomposed and everyone will operate within a given paradigm. The relationship itself between the various sub-systems may be governed by a di.erent paradigm from those governing the subsystems. By analogy with the biotope where the objective is to maintain equilibrium between individuals, species and environment, the objective which we assign to a multi-paradigm approach is to provide a system with performance relatively best and good quality (reliability, development facility, maintainability, portability, etc.). 4.4.2 Rules of application of a multi-paradigm approach This vision of complex systems raises remarks to be mentioned: 1. The multi-paradigm approach is simply a further bio-inspiration that makes the analogy between an artifcial system and a biotope. It is not limited by a single metaphor but by many. 2. The multi-paradigm approach is a systemic approach that aims to integrate or hybridize the paradigms to take advantage of their synergy. For example, a system can be modelled as an ant’s colony that uses genetic algorithms as a computational model. 3. In absolute, no paradigm dominates the other, but, a paradigm may be at the forefront in a context and second plane in another. For example, a system can be modelled as an evolving species (applying an evolutionary approach) where individuals are neural networks for which we try to improve the confguration or the synaptic weights. The opposite is also possible; for example a neural network where each node computes its combination function by a genetic algorithm. 4. Paradigms can be used in re-entrant order. For example, a neural network whose outputs are used to select another one among several neural networks. Note to fnish this section, that the persistence of various programming languages and their coexistence is a fact that illustrates the practical relevance of a multi-paradigm approach (Case of the .NET ’dotnet’ platform of Microsoft, which is independent of any programming language and natively supports a large number). The next section focuses on the modelling issue as part of a multi-paradigm bio-inspired approach. The Bio-IR Modelling In the context of a bio-inspired design, our goal is to use a generic model to unify the diversity of concepts specifc to the considered bio-inspired paradigms. A recapitulative refection and analysis can be performed on what was presented in the previous sections. Indeed, besides the fact that MAS, like natural systems, consider that the systems are composed of interacting entities, there is a great similarity in the criteria for characterizing bio-inspired and MAS approaches (Table 5 and Table 6). It is possible to classify these characteristics into two categories: the intra-entity and inter-entities characteristics. In other words, we characterize the entities taken separately (atomic; referring to individuals) as we characterize their interactions (composed; referring to an aggregate of individuals). We notice that the same fact has been established for the classifcation of bio-inspired paradigms (Table 3). For these reasons, we believe that the multi-agent systems approach is naturally placed as a prime candidate to act as a unifying modelling for biomorphic systems. Figure 8 describes the meta-model of a general case of multi-paradigm bio-inspired multi-agent system with biomorphic agent and biomorphic group. We notice that it includes the six bio-inspired paradigms cited in this paper. For a new bio-inspired paradigm we have to classify it in micro/macro level. We must follow the Table 4’ recommendations, according to the bio-inspired metaphor’s nature, its particular use and the needed abstraction level. If it belongs to a macro level, we add it as a specialisation of the group (inheritance). Otherwise it will be added as a specialisation of the agent (being in the micro level). Figure 8: Meta-model for a multi-paradigm bio-inspired multi-agent system. The complex nature of biomorphic systems is exhibited by di.erent aspects ranging from simple computation, optimization, through complex coordination and symbolic resolutions. Using MAS to address these issues in a multi-paradigm context, we identify three possible scenarios: 1. Intra-agent approach: Where the agent encapsulates a processing according to a given bio-inspired paradigm (as a computational model for instance). The system is seen as an aggregation of biomorphic agents. This scenario has the advantage of encapsulating the diversity of paradigms in agents, which is interesting in terms of development: work division between teams (so it is the case of a modelling with only bio-inspired agents and without bio-inspired groups, (Figure 9)); Figure 9: Meta-model for a bio-inspired agent. 2. Inter-agents approach: Where the bio-inspired aspect appears through the interactions of agents (i.e. MAS), we converge to a bio-inspired group behaviour with non-bio-inspired agents (Figure 10); 3. Hybrid approach: Where the previous two scenarios are combined. The system is then seen as a biomorphic group of biomorphic agents (the case of a modelling with bio-inspired groups and bio-inspired agents, (Figure 11)). We notice that, in our model, there are no constraints on the type/architecture of the agent. In the micro level, the agent will be cognitive according to the bio-inspired approach that it holds. In this case, its Computation module must be, consequently, sophisticated. In a macro level the agent is generally reactive. Formally and at a higher level of abstraction, in biomorphic MAS the three previous cases will be refected in two levels as follows: -Agent level We use an agent model which must support the biological dimension; it will be designed by ensuring real autonomy with the separation between the state variables of the decisional system (the mind) and the physical component (the body). These interacting agents can be structured in groups (Figure 12.a). -Group level The resulting system is an aggregate of interacting agents. These interactions will be managed by a separate interaction module. We emphasize the active character of the environment to be modelled explicitly. This feature is because it has its own process that can change its state, regardless of the actions of its agents. The states of various agents are coupled to the state of the environment. This coupling will be performed using the infuence/reaction principle. We model a bio-inspired infuence/reaction multi-agent system as follows (Figure 12): Bio.IR.M = {{Bio.IR.A}, Bio.IR.E, Bio.IR.C}, Where: 1. Bio-IR-A; the Agent component: An agent does not have a direct control over the result of its infuences on the environment, including on its physical component state variables. The agent has to emit infuences to the interaction module. But in the opposite, the agent can use and modify its decisional system state variables, its physical component state variables can be changed by an external component (as a reaction to the environment component for instance) (Figure 12.a). 2. Bio-IR-C; the Coupling component: The coupling module manages interactions by composing the agent/environment infuences which are simultaneous and then forward the result to the environment/agent component (Figure 12.b). 3. Bio-IR-E; the Environment component: as an active component, the environment re-acts (by its own infuence) to the agents’ infuences based on its own process and state. The environment can not only use and modify its state variables but also change the agent physicapl component state variables through the coupling module (Figure 12.c). However, the environment cannot reach the agent decisional system variables. The outgoing arrows from a database are read access, the incoming ones are updates. This model can preserve the integrity of our agents by separating their state variables. Decisional system variables are accessed / modified only by the agent during the infuence phase. The physical component variables are part of the environment and are modifed only by the environment during the reaction phase. The reaction of agent/environment is in our case an infuence wished to be performed on the environment/agent and it is not, any more, a traditional action, in the artifcial intelligence sense. Even if the infuence/reaction principle does not a.ect the simultaneous action and the interaction modelling, this principle improves the information dissemination mechanisms to increase the system’s reactivity. To this end, we summarize the main characteristics of our proposal in: 1. The application of infuence/reaction principle. o Able to model concurrent and joined behaviours. o Abandon the representation of the action as a modification of the system’ global state. o Improve mechanisms for disseminating information to increase agent reactivity. 2. Isolating an interaction module (the coupling module). Use all the infuences produced at a moment to compute the new state of the world. 3. The guarantee of the agent integrity (autonomy) by the distinction between the decisional system state variables of an agent and variables concerning his physical aspect. 4. The explicit modelling of the environment. Application case studies We take, as a frst case study, the use of Ant Algorithm (Ant Colony Optimization meta-heuristic) applied to the famous Travelling Salesman Problem (TSP). Figure 13 illustrates the modelling of a TSP Ant System, according to our model and using an adapted AGRE organizational model [20]: a special consideration for the environment and a double circle for the bio-inspired aspect. In this case we have a macro level bio-inspiration represented with a biomorphic group ”Validation”, implementing the ACO approach to fnd the shortest circuit of towns. The agents ant in this implementation use the probability depending on distance and the pheromone density on every path between towns to choose the next town to move to (the corresponding Meta-model is given in Figure 14). Figure 15 shows the modelling of a TSP Ant System with a macro level bio-inspiration: a biomorphic group ”Validation”, implementing the ACO approach and a micro level bio-inspiration: a biomorphic agent ”Ant”, using, for instance, as computational model a Genetic Algorithm to choose the next town to move to (its Meta-model is presented in fgure 16). In both cases the coupling is performed with the infuence/reaction principle. The environment can be seen as a graph, where nodes are towns and arcs/weights are paths/distances between towns. An implementation on the JADE platform for the frst case can be found in [67] comparing the three basic Ant System Variants: Ant-Cycle, Ant-Density and Ant-Quantity [13, 14]. The obtained results are promising in both SE and DAI fields (Figure 4 and Figure 5 in Section 3.1). That encourages us to look after improved variants of ant algorithms, such as the max-min ant system [57] and to explore other aspect using JADE and Madkit platforms to propose our improved Ant Algorithm. A second case study concerns the Time Tabling Problem (TTP) solved with an Ant Algorithm too. Figure 13 and Figure 14 can illustrate, respectively the modelling of TTP Ant System and its meta-model. In this case the environment is a graph, where nodes are sessions’ extremities (begins/ends); arcs and their weights are duration and classes/classrooms. Consequently, ants (teachers) perform following an adapted process. Another case is to deal with TTP, using a Grey Wolf Optimization (GWO) [43]. In this case, we have, just, to replace Ant with Wolf (teacher) in Figure 13 and ACO with GWO in Figure 14 to illustrate, respectively the modelling of TTP Grey Wolf Optimization System and its meta-model. 7 Related Works We can fnd various examples of bio-inspired multi-agent systems. Most works have a specifc purpose and are su.ering from the fact to be designed using an Ad hoc process and ”methodology” or targeting one bio-inspired feature. A frst example and as a dedicated agent-based methodology, [6] presents the ADELFE methodology. ADELFE is devoted to the design of adaptive and cooperative multi-agent systems and relies on the AMAS theory ”Adaptive Multi-Agent Systems”. It seems to be a Informatica 42 (2018) 451–466 463 candidate for the handling of a class of biomorphic systems characterized by swarm intelligence. A second example is taken from the engineering of self-organization in multi-agent systems. Inspired from multi-cellular organisms, Nagpal in [45] gives a set of bio-inspired primitives engineering in robotics. In [8], author gives another example to build bio-inspired self-adapting systems; it deals with particular software systems, and presents the use of architectural styles in a software architectural perspective applied to problems with shared characteristics. It consists mainly to create a model for a given biological system. This model has to be studied until being completely understood. After that, in an iterative cycle, designers build on this initial model the target biological system. A concrete case was given for a discreet distribution problem: distributing a computation on a large network, where any small group of nodes ignore the problem they are helping to solve. We can conclude that all existing works remain specifc for particular domains and classes of problems and don’t support and encourage reuse. At variance, and with more general vision, useful guidelines to a better defnition and characteristics of biomorphic MAS were given in [48, 61] encouraging an advanced bio-inspiration which can lead to a generic process according to our topic. Another work suggests the extension of the AGR organizational model (Agent, Role and Group) [18], which gives rise to AGRE model [20]. AGRE includes the environmental dimension and crosses with our vision of the development issue of biomorphic multi-agent system. In [64, 65] authors present a general multi-agent framework called SAPERE (Self-aware Pervasive Service Ecosystems). SAPERE deals with pervasive systems seen as an ecosystem where the pervasive computing services are carried with multi-agent systems. Their contribution aims to perform the interactions between these services (MASs) with respect of bio-inspired laws summed in: Bound, Aggregate, Decay and Spread. In our case, we deal with natural systems with a multi paradigm modelling approach seen as a biotope or an ecosystem (system of interacting systems). These interacting systems implement a given bio-inspired paradigm and the interaction between them, itself, may be governed by a bio-inspired paradigm too. Our contribution aims to model these interacting systems and their interaction with multi agent systems with respect of the Infuence/Reaction Principle. So, we, both, use some common concepts and terminologies but in di.erent levels: They tackle, with a bio-inspired approach, the interaction issue between an ecosystem’s systems assumed multi agent, when we tackle, with a multi agent approach (using the Infuence/Reaction Principle to manage agent’s interaction), the modelling issue of an ecosystem’s systems and their interaction assumed, both, bio-inspired. Their work can be seen as an ideal general case study of our work, if their pervasive computing services were all bio-inspired with an influence/reaction’s interaction model. In [39], authors allow agents, in MAS technologies, to adopt dynamically an interaction’s mean among di.erent possible ones. Concretely, they used the TuCSoN (Tuple Center Spread over the Network) dedicated agent platform within the JADE and Jason platforms. TuCSoN use a logic-based coordination language (ReSpecT), it is a Java library to model coordination in distributed processes (such as autonomous, intelligent and mobile agents). The idea is interesting and can be used with our multi-paradigm vision to integrate di.erent bio-inspired paradigms. When the bio-inspired paradigm is hold at a micro level by agents (they must be intelligent) or in a macro level by MAS based on small number of intelligent agents, the idea is worthwhile. But, when the bio-inspired paradigm is hold at a macro level by MAS based on big number of simple (not intelligent) agent (as indicated with MAS presentation in Section 2.2 and noticed in Section 5) the idea will be less useful. 8 Conclusion To deal with the proliferation of biomorphic systems it has become necessary to focus attention and research e.orts on their modelling. Such modelling must encompass all the di.erent bio-inspired concepts. In this paper, we have advocated for a generic infuence/reaction agent-based model which integrates various bio-inspired paradigms. We consider this work as a step towards a development methodology for biomorphic MAS. Based on the fact that MAS represent a potentially unifying paradigm, a frst perspective is to establish a synthesis of agent-based methodologies and identify a kernel to adapt, in order to incorporate a meta-model based on our generic bio-inspired model. The degree of adaptation of a development approach, to this objective, depends not only on the diversity of the considered bio-inspired approaches but also their possible combinations, enriching their existing scope of applicability. In such multi-paradigm context, a second perspective would be to reconsider this kernel to exploit the power of bio-inspired approaches. Where for a given problem and knowing all its specifc criteria, we will be able to reach the state for a real guidance of the user to choose the bio-inspired paradigm to apply or indicate possible combinations. References [1] Agha, G. A., (1985). Actors: A model of concurrent computation in distributed systems, technical report, DTIC Document. [2] Amine, L. M. and Nadjet, K., (2015). A Multi-objective Binary Bat Algorithm, Proceedings of the International Conference on Intelligent Information Processing, Security and Advanced Communication, ACM, 75. D. Zeghida et al. [3] Bakhouya, M., Gaber, J. and Koukam, A., (2003). Bio-inspired model for behavior emergence: Modelling and case study, Procs. Of Knowledge Grid and Grid Intelligence workshop (KGGI’03) at IEEE WI/IAT’03. [4] Bari, A., Wazed, S., Jaekel, A. and Bandyopad­hyay, S., (2009). A genetic algorithm based approach for energy e.cient routing in two-tiered sensor networks, Ad Hoc Networks Journal, Elsevier, (7), (4), 665–676. [5] Bellifemine, F., Poggi, A. and Rimassa, G., (1999). JADE–A FIPA-compliant agent framework, Proceedings of PAAM, London, (99), 97–108. [6] Bernon, C., Gleizes, M-P., Peyruqueou, S. and Picard, G., (2002). ADELFE: a methodology for adaptive multi-agent systems engineering, International Workshop on Engineering Societies in the Agents World, Springer, 156–169. [7] Broecker, B., Caliskanelli, I., Tuyls, K., Sklar, E. and Hennes, D., (2015). Social insect-inspired multi-robot coverage, Proceedings of the International Conference on Autonomous Agents and Multiagent Systems, International Foundation for Autonomous Agents and Multiagent Systems, 1775–1776. [8] Brun, Y., (2008). Building Biologically-Inspired Self-Adapting Systems, Dagstuhl Seminar Proceedings, Schloss Dagstuhl-Leibniz-Zentrum fr Informatik. [9] Capera, D., George, J.P., Gleizes, M.P. and Glize, P., (2003). The AMAS theory for complex problem solving based on self-organizing cooperative agents, Enabling Technologies: Infrastructure for Collaborative Enterprises, WET ICE 2003. Proceedings. Twelfth IEEE International Workshops on, IEEE, 383–388. [10]Colorni, A., Dorigo, M., Maniezzo, V. and others, (1991). Distributed optimization by ant colonies, Proceedings of the frst European conference on artifcial life, (142), 134–142. [11]Cuperlier, N., Guedjou, H., de Melo, F., and Miramond, B., (2016). Attention-based smart-camera for spatial cognition, Proceedings of the 10th International Conference on Distributed Smart Camera, ACM, 121–127. https://doi.org/10.1145/2967413.2967440 [12]Da Silva, J.L.T. and Demazeau, Y., (2002). Vowels co-ordination model, Proceedings of the frst international joint conference on Autonomous agents and multiagent systems: part 3, ACM, 1129­1136. [13]Dorigo, M., Maniezzo, V. and Colorni, A., (1991). The ant system: An autocatalytic optimizing process. Tech. rep., Italy: Dipartimento di Elettronica, Politecnico di Milano. PMid:1810600 PMCid:PMC1908841 [14]Dorigo, M. and Gambardella, L., (1997). Ant colonies for the travelling salesman problem. BioSystems, Elsevier, (43), (2), 73–81. https://doi.org/10.1016/S0303-2647(97)01708-5 [15]Dorigo, M. and Di Caro, G., (1999). Ant colony optimization: a new meta-heuristic, Evolutionary Computation, CEC 99. Proceedings of the 1999 Congress on, IEEE, (2), 1470–1477. [16]Dréo, J., Pétrowski, A., Siarry, P. and Taillard, E., (2006). Metaheuristics for hard optimization: methods and case studies, Springer Science & Business Media. PMCid:PMC1569826 [17]Ferber, J., and Mller, J-P., (1996). Infuences and reaction: a model of situated multiagent systems, Proceedings of Second International Conference on Multi-Agent Systems (ICMAS-96), 72–79. [18]Ferber, J. and Gutknecht, O., (1998). A meta-model for the analysis and design of organizations in multi-agent systems, MultiAgent Systems, Proceedings. International Conference on, IEEE, 128–135. [19]Ferber, Jacques, (1999) Multi-agent systems: an introduction to distributed artifcial intelligence, Addison-Wesley Reading. PMCid:PMC1736616 [20]Ferber, J., Michel, F. and Báez, J., (2004). AGRE: Integrating environments with organizations, International Workshop on Environments for Multi-Agent Systems, Springer, 48–56. [21]Fink, G.A., Haack, J.N., McKinnon, A.D. and Fulp, E.W., (2014). Defense on the move: ant-based cyber defense, IEEE Security & Privacy, IEEE, (12), (2), 36–43. [22]Florin Leon, Marcin Paprzycki, and Maria Ganzha. (2015). A Review of Agent Platforms, Multi-Paradigm Modelling for Cyber-Physical Systems (MPM4CPS), ICT COST Action IC1404, 1–15. [23]Fogel, D.B., (1988). An evolutionary approach to the traveling salesman problem, Biological Cybernetics Journal, Springer, (60), (2), 139–144. [24]Gengan, D., Schoeman, M.A. and Van Der Poll, J.A., (2014). An Ant-based Mobile Agent Approach to Resource Discovery in Grid Computing, Proceedings of the Southern African Institute for Computer Scientist and Information Technologists Annual Conference 2014 on SAICSIT 2014 Empowered by Technology, ACM, 1. PMid:25528196 [25]Gonçalves, F. A.C.A., Guimaraes, F.G. and Souza, Marcone J.F., (2013). An evolutionary multi-agent system for database query optimization, Proceedings of the 15th annual conference on Genetic and evolutionary computation, ACM, 535–542. PMid:23278174 [26]Gutknecht, O. and Ferber, J., (2000). The madkit agent platform architecture, Workshop on Infrastructure for Scalable Multi-Agent Systems at the International Conference on Autonomous Agents, Springer, 48–55. [27]Hong, T-P., Huang, L-I. and Lin, W-Y., (2014). A Di.erent Perspective on Parallel Sub-Ant-Colonies, Proceedings of the 12th International Conference on Advances in Mobile Computing and Multimedia, ACM, 322–325. PMCid:PMC4346605 [28]Huget, M-P., (2014). Agent Communication, Agent-Oriented Software Engineering, Springer, 101–133. Informatica 42 (2018) 451–466 465 [29]Jennings, N.R., Sycara, K. and Wooldridge, M., (1998). A roadmap of agent research and development, Autonomous agents and multi-agent systems, Kluwer Academic Publishers, (1), (1), 7– 38. [30]Jennings, N.R., (2001). An agent-based approach for building complex software systems, Communications ACM Journal, ACM, (44), (4), 35– 41. [31]Karmani, R.K., and Shali, A. and Agha, G., (2009). Actor frameworks for the JVM platform: a comparative analysis, Proceedings of the 7th International Conference on Principles and Practice of Programming in Java, ACM, 11–20. [32]Karmani, R.K. and Agha, G., (2011). Actors, Encyclopedia of Parallel Computing, Springer, 1– 11. [33]Kosakaya, J., (2016). Multi-agent-based SCADA system, Event-based Control, Communication, and Signal Processing (EBCCSP), 2016 Second International Conference on, IEEE, 1–5. [34]Lee, U., Magistretti, E., Gerla, M., Bellavista, P., Li´o, P. and Lee, K.W., (2009). Bio-inspired multi-agent data harvesting in a proactive urban monitoring environment, Ad Hoc Networks Journal, Elsevier, (7), (4), 725–741. [35]Lin, C-T. and Lee, C.S.G., (1991). Neural-network-based fuzzy logic control and decision system, IEEE Transactions on computers, IEEE, (40), (12), 1320– 1336. [36]Lodding, K.N., (2004). The Hitchhiker’s Guide to Biomorphic Software, ACM Queue Journal, ACM, (2), (4), 66–75. [37]Ma, J., Man, K.L., Ting, T. Zhang, N., Guan, S-U. and Wong, P.WH., (2014). Accelerating Parameter Estimation for Photovoltaic Models via Parallel Particle Swarm Optimization, Computer, Consumer and Control (IS3C), International Symposium on, IEEE, 175–178. [38]Manate, B., Fortis, F. and Moore, P., (2014). Applying the Prometheus methodology for an Internet of Things architecture, Proceedings of the 2014 IEEE/ACM 7th International Conference on Utility and Cloud Computing, IEEE Computer Society, 435–442. [39]Mariani, S. and Omicini, A., (2016). Multi-paradigm Coordination for MAS: Integrating Heterogeneous Coordination Approaches in MAS Technologies, WOA, 91–99. [40]Massawe, L.V., Aghdasi, F. and Kinyua, J., (2009). The development of a multi-agent based middleware for RFID asset management system using the PASSI methodology, Information Technology: New Generations, ITNG’09. Sixth International Conference on, IEEE, 1042–1048. [41]Michel, F., (2004). Formalism, tools and methodological elements for the modeling and simulation of multi-agents systems, PhD thesis, LIRMM, Montpellier, France. [42]Michel, F., Ferber, J., Drogoul, A. and others, (2009). Multi-Agent Systems and Simulation: a Survey From the Agents Community’s Perspective, Multi-Agent Systems: Simulation and Applications Journal, 3–52. [43]Mirjalili, S., Mirjalili, S.M. and Lewis, A., (2014). Grey wolf optimizer, Advances in engineering software, Elsevier, (69), 46–61. [44]Mochalov, V., (2015). Multi-agent bio-inspired algorithms for wireless sensor network design, Advanced Communication Technology (ICACT), 17th International Conference on, IEEE, 33–42. [45]Nagpal, R., (2003). A catalog of biologically-inspired primitives for engineering self-organization, International Workshop on Engineering Self-Organising Applications, Springer, 53–62. [46]Olaifa, M., Mapayi, T. and Van Der Merwe, R., (2015). Multi Ant LA: An adaptive multi agent resource discovery for peer to peer grid systems, Science and Information Conference (SAI), IEEE, 447–451. [47]Padmanaban, R., Thirumaran, M., Suganya, K. and Priya, R.V., (2016). AOSE Methodologies and Comparison of Object Oriented and Agent Oriented Software Testing, Proceedings of the International Conference on Informatics and Analytics, ACM, 119. https://doi.org/10.1145/2980258.2982111 [48]Parunak, H.V.D., (1997). ”Go to the ant”: Engineering principles from natural multi-agent systems, Annals of Operations Research Journal, JC BALTZER AG, (75), 69–102. [49]Perez-Carabaza, S., Besada-Portas, E., Lopez-Orozco, J.A. and de la Cruz, J.M., (2016). A Real World Multi-UAV Evolutionary Planner for Minimum Time Target Detection, Proceedings of the 2016 on Genetic and Evolutionary Computation Conference, ACM, 981–988. https://doi.org/10.1145/2908812.2908876 [50]Perez-Diaz, F., Zillmer, R. and Groß, R., (2015). Firefy-Inspired Synchronization in Swarms of Mobile Agents, Proceedings of the International Conference on Autonomous Agents and Multiagent Systems, International Foundation for Autonomous Agents and Multiagent Systems, 279–286. [51]Qian, B. and Cheng, H.H., (2016). A mobile agent-based coalition formation system for multi-robot systems, Mechatronic and Embedded Systems and Applications (MESA), 2016 12th IEEE/ASME International Conference on, IEEE, 1–6. [52]Rehberger, S., Spreiter, L. and Vogel-Heuser, B., (2016). An agent approach to fexible automated production systems based on discrete and continuous reasoning, Automation Science and Engineering (CASE), IEEE International Conference on, IEEE, 1249–1256. [53]Rowley, H.A., Baluja, S. and Kanade, T., (1998). Neural network-based face detection, Pattern Analysis and Machine Intelligence Journal, IEEE Transactions on, IEEE, (20), (1), 23–38. [54]Silva, D.C., Braga, R.A-M., Reis, L.P. and Oliveira, E., (2010). A generic model for a robotic agent system using GAIA methodology: Two distinct D. Zeghida et al. implementations, Robotics Automation and Mechatronics (RAM), IEEE Conference on, IEEE, 280–285. [55]Sturm, A. and Shehory, O., (2014). Agent-oriented software engineering: revisiting the state of the art, Agent-Oriented Software Engineering, Springer, 13– 26. [56]Sturm, A. and Shehory, O., (2014). The landscape of agent-oriented methodologies, Agent-Oriented Software Engineering, Springer, 137–154. [57]Sttzle, T. and Hoos, H., (1998). Improvements on the ant-system: Introducing the max-min ant system. Artifcial Neural Nets and Genetic Algorithms, Springer, 245–249. https://doi.org/10.1007/978-3-7091-6492-1_54 [58]Tarakanov, A., (2001). Information security with formal immune networks, Information Assurance in Computer Networks Journal, LNCS, Springer, (2052), 115–126. [59]Tsang, C.H. and Kwong, S., (2005). Multi-agent intrusion detection system in industrial network using ant colony clustering approach and unsupervised feature extraction, Industrial Technology, ICIT 2005. IEEE International Conference on, IEEE, 51-56. [60]Wang, J., Cao, J., Li, B., Lee, S. and Sherratt, R. S., (2015). Bio-inspired ant colony optimization based clustering algorithm with mobile sinks for applications in consumer home automation networks, IEEE Transactions on Consumer Electronics, IEEE, (61), (4), 438–444. [61]Weyns, D., Parunak, H.V.D. and Michel, F., (2006). Environments for Multi-Agent Systems II, Second International Workshop, E4MAS 2005, Utrecht, The Netherlands, July 25, 2005, Selected Revised and Invited Papers, Springer, (3830). [62]Wooldridge, M. and Jennings, N.R., (1994). Agent theories, architectures, and languages: a survey, International Workshop on Agent Theories, Architectures, and Languages, Springer, 1–39. [63]Xiang, W. and Lee, HP., (2008). Ant colony intelligence in multi-agent dynamic manufacturing scheduling, Engineering Applications of Artifcial Intelligence Journal, Elsevier, (21), (1), 73–85. [64]Zambonelli, F., (2015). Engineering Environment-Mediated Coordination via Nature-Inspired Laws, Agent Environments for Multi-Agent Systems IV, Springer, 63–75. [65]Zambonelli, F., Omicini, A., Anzengruber, B. and others, (2015). Developing pervasive multi-agent systems with nature-inspired coordination, Pervasive and Mobile Computing, Elsevier, (17), 236–252. [66]Zeghida, D., (2003). MALCS: Multi-Agent Learning Companions System, DEA thesis, Badji Mokhtar Annaba University, Algeria. [67]Zeghida, D., Meslati, D., Bounour, N. and Allat Y., (2018). Agent Infuence/Reaction Ant System Variants: An Experimental Comparison, International Journal of Artifcial Intelligence, (16), (2), 60-77. EID: 2-s2.0-85053662680 https://doi.org/10.31449/inf.v42i3.2424 Informatica 42 (2018) 467–475 467 Empirical Study on the Optimization Strategy of Subject Metro Design Based on Virtual Reality Zhendong Wu College of Mechanical Engineering and Automation, Huaqiao University, Xiamen, Fujian 361021, China E-mail: zdongwuhqu@yeah.net Technical Paper Keywords: virtual reality eye movement, visual attention, theme subway, design strategy Received: August 23, 2018 A three-dimensional simulation interactive virtual scene was established taking the theme subway in Chengdu and Guangzhou as the typical case, and the standard metro in Xiamen as the reference. An experiment was designed using virtual reality built-in eye movement equipment and following the principle of visual attentiveness. The conscious and unconscious visual behaviors of users were analyzed and the impacts of different design methods on user experience and behaviors were analyzed. This study extracted the key elements of the theme subway design and recombine them to compared the design of facilities at the same position but in different themes and the design of space interface at different positions but in the same theme. Moreover, optimization strategies were put forward for the design of theme subway space to enhance the availability of the design. Povzetek: Predstavljena je virtualna študija za učinkovito predstavitev podzemnih kitajskih železnic uporabnikom. The status of internal design of theme subway According to the statistics of China's rail transit network, 24 of 30 cities which have subway have opened theme subway, and 69 themes are included [i]. Theme design refers to a design method of setting up a series of scenes or events in the form of visual creativity by means of narrative techniques to establish a transfer link with audience. Theme subway which applies theme design is taken as the material carrier of social and cultural information, which solves the problem of characteristic crisis of space [ii], makes people perceive the cultural atmosphere and connotation of a city or region, and enhance the identifiability of a city. The current design of theme subway in China focuses on the publicity of theme, still in the initial stage. Design ideas of the internal space of subway is monotonous. The design ideas include texture design and design of three-dimensional modeling. Texture design means directly applying theme related picture materials on the wall surface of carriages. Design of three-dimensional modeling means transforming three-dimensional cultural model or plane cultural elements to three-dimensional models, taking them as facilities or decoration carriers, and endowing them with practical or decorative functions. Wang Dawei from Shanghai Academy of Fine Arts proposed the theoretical model of space design of subway station which is a cube model composing of professional emphasis, design elements and subway space. The design should consider not only visual aesthetics but also the security, comfortability, economical efficiency and sustainability of metro space design [iii]. The design of theme subway needs to reduce the psychological pressure of passengers in the claustrophobic space from the psychological point of view and improve the visual comfort of passengers in the subway environment. As to the content of theme culture, the design should guide passengers to form short-term memory and information feedback. The current theme subway is mainly based on experience design; the good and bad efficacy are intermingled because of the lack of scientific basis and experimental verification. 2 Reflection of design of theme subway based on the principle of visual attention James (1890) first proposed that the directivity and centrality of perception are the two basic characteristics of attention and only a few objects will be noticed at the moment when objects are sensed [ iv ]. It shows that attention has the role of screening. In the process of visual scene observation, the user's visual presentation is progressive and incomplete [v].When attention is focused on the perceived area, consciousness can capture the stimulation of interest in that area. Von Helmholtz (1925) and James (1890) put forward "where"and "what" [vi]."Where" is a view put forward by Vol Helmholtz. He focused on the relationship between eye movement and spatial position. It means that involuntary attention is a fixation behavior which is based on individual experience or task objectives and controlled by consciousness or autonomous behaviors. "What" is a view put forward by James. He thought that attention is a mechanism with hidden inner. It is active and voluntary and relates to properties, significance and expectation of attention focus, involving information processing, refers to voluntary fixation behavior under an unconscious state. Short-term memory and even long-term memory will form when people use the attention of high-level cognitive ability. Attention is an important psychological adjustment mechanism in the process of visual information processing, and it is not only related to individual cognition, but also influenced by emotional mechanism [vii]. In Emotional Design, Donald Norman proposed three levels of brain information processing, in which the instinctive level is the subconscious judgment determined by biological heredity (subconscious judgment of individual behavior), and the behavior layer is a habit based reaction (based on the individual experience). They correspond to high level of attention (involuntary attention) and low level of attention (voluntary attention) respectively. According to the above principles, when passengers enter a carriage, observation of the instinctive layer is firstly induced, and the attention is active and voluntary at that moment. If a certain area in a scene arises passengers' interest, the area will be perceived by the vision around the central fovea, and then more detailed content is perceived. The visual fixation area and time data of passengers in this state are obtained through first fixation time among eye movement indexes. First fixation time refers to early identification process of area of interest and sensitivity to processing difficulty of area of interest. Shorter time indicates that the region is easier to be concerned by users [ viii ]. The visual design features of the interest area inside the carriage were summarized. In addition, the involuntary attention of the behavior level is controlled by consciousness and autonomous behaviors. The scope of attention is very small; hence passengers perceive all the parts of the scene through continuous scanning. Whether the design of carriage induces the short-term memory of passengers can be analyzed based on the division of area of interest of heat map and retention time. Area of interest (AOI): It is usually used for design availability analysis. Some element of interface is isolated as a specific area or content for further analysis [ ix ]. Heat map refers to information visualization graph which presents eye movement data in the form of cloud picture [x]. It can intuitively present the area of interest of the subject and analyze the focus area and retention time. Retention time: retention time is a very good index for testing the degree of interest for a specified area of interest. Longer retention time means more interests of users on an area of interest [xi]. Z. Wu 3 Necessity of virtual reality eye movement linkage experiment In the past, the experimental research on internal design of rail transit was mainly based on rendering pictures and portable eye movement equipment. The maturity of the technology which combines head mounted virtual equipment with eye movement instrument in 2016 provides technical support for the accuracy of data and control of independent variables of such kind of study [ xii ]. Compared with the previous experiments, the advantages of the experiments which apply the new technology are mainly reflected in the following three aspects. (1) Strong immersion in visual scene and more objective data The visual scene is a three-dimensional simulation model with high preciseness, which restores the internal facilities, lighting, dimension sense of space and texture of the real scene; therefore subjects can obtain real experience in the virtual environment [xiii]. For example, in the course of the experiment, subjects try to grasp a handrail after entering the scene, which shows that the scene is very vivid and more objective eye movement data can be obtained. In the previous experiments, two-dimensional pictures were usually used as stimuli, and subjects cannot feel immersed, which affected the objectivity of data. (2) Extraction of implicit data and recording while looking The experiment of portable head eye tracker combined with pictures is very difficult for users to concentrate due to the large error of eye movement data and the small size of stimulon. The virtual reality scene can be observed in 360 degrees, and the subjects' perspective is large, which can not be constrained by the size of picture. It can collect the eye tracking data consciously and unconsciously (implicit) in the virtual scene in real time to perfect categories of data [xiv]. (3) Effective control of independent variables is beneficial to comparison Design factors which can affect user experience include content of theme, shape design, area of pattern and position of decoration. Changes of variables in virtual eye movement experiments will generate different scenes; in this way, scene changes can be realized under no disturbance. It is beneficial for comparing reactive states of user experience and analyzing the relationship between variables and design. For example, visual attention of subject will transfer when the color saturation inside subway carriage is too high. 4 Experimental design Taking the theme subway in Xiamen, Chengdu and Guangzhou as the research subjects, this study established a three-dimensional simulation interactive virtual scene. The eye movement of users in the scene was recorded. The influence of different designs on user experience was compared and analyzed based on the Figure 1: Research ideas. Figure 2: Experimental samples. principle of visual attention. Based on it, design strategies for subway space were summarized. The technology combing virtual reality with eye movement which was independently developed by Shanghai Qingtech Co., Ltd., China. An eye tracking module was inserted into HTC vive to track and record the real-time eye movement data in the virtual visual scene. Research ideas are shown in Figure 1. 4.1 Survey of design cases and selection of samples Sixty-nine design cases of theme subway in 24 cities in China were collected, and two of them was selected as the representative experimental samples. Panda theme subway on Line 3 in Chengdu and cartoon theme subway in Guangzhou were selected as comparison samples, and the standard subway on Line 1 in Xiamen was selected as the reference of this study, as shown in Figure 2. In the panda theme subway on Line 3 in Chengdu, panda which is a regional cultural characteristic of Chengdu was taken as the design element, and the image of panda was integrated into the appearance design of seat, handle and side walls to create three-dimensional models. The standard subway on Line 1 in Xiamen has no theme, which is the mode of most standard subways in China. Analysis on recombination of design elements. As the design of theme subways involves many factors, pattern area,modeling technique and decoration position were selected as the key elements for comprehensive analysis. The proportion of pattern area refers to the proportion of the pattern area inside a carriage to the total area, and it has three grades, 10% ~ 30%, 30% ~ 60% and 60% ~ 100%. Modeling techniques include design of three-dimensional modeling and design of texture. Design of three-dimensional modeling mainly focuses on positions of handles, handrailings and seats, while design of three-dimensional modeling focuses on side walls, end walls and top surface. Design elements were classified and then recombined. Two design issues were analyzed. The first issue was the comparison of subway facility design in different themes but at the same position, and the second issue was the comparison of visual perception of different spatial interface design in the same theme but at different positions. Based on it, design strategies of hardware facilities and interface which was more in line with the principle of visual attention could be put forward. Details are shown in Figure 3. 4.2 Subjects In this experiment, there were 30 subjects, aged 18 ~ 35 years. In order to ensure that all the subjects had the same cognitive level, all of them had no virtual reality experience, but had the experience of taking the subway. The uncorrected or corrected visual acuity of the subjects were normal, and neither of them had color blindness. At the beginning of the experiment, the subjects were asked to receive an eye movement calibration test which lasted for 30 ~ 60 s in the scene. The formal test started after the eye movement calibration; they had a visual activity Figure 3: Two design issues corresponding to the recombination of design elements. Figure 4: The division of area of interest in three subway carriages. of random observation in the virtual scene which lasted for 120 s. 4.3 Analysis of experimental results Firstly, the area of interest inside the subway was divided. As shown in Figure 4, the subway carriage was divided into six regions of interest according to the space region and functional facilities: top surface, side walls, end walls, ground, handles and seats. The first fixation duration and retention time in the six regions of interest were recorded. The mean values and variances of the eye movement data were statistically analyzed using SPSS to evaluate the internal design of the carriage. 4.3.1 The comparison of facility design in the same theme but at different positions User will unconsciously observe firstly when he enters a [Pritegnite pozornost bralca z odličnim citatom iz dokumenta ali pa izkoristite ta prostor, da poudarite ključno točko. Če želite premakniti to polje z besedilom na katero koli drugo mesto na strani, ga preprosto povlecite.] carriage for the first time. The impact of different subway facility designs on the attention of users was analyzed. According to the analysis of the heat map of the standard subway, it was found that the seat and handrail facilities of the subway were the concerns of passengers, as shown in Figure 5. Therefore, different designs of positions of handrails and seats in the subway carriage was compared. The fixation condition of users in an unconscious state was analyzed by performing descriptive analysis on the first fixation time of users, as shown in Table 1. Different design methods for the same location and different themes have different effects on eye movement data. The three-dimensional design of the seats and handles which took panda as the element in the Chengdu theme subway attracted the most interests and attentions from users, as shown in Figure 6. The minimum value was 43.83, and the sensitivity was high in the early recognition process. Next was the standard subway, the Table 1: The descriptive analysis results of the first fixation in the facility design. Descriptive statistics of the first fixation time in the facility design Experiemnt Mean Standard Facility Remark Remark al samples value deviation Seats Standard subway 70.27 Smaller mean value means users are more likely to pay attention to the facility. 73.21 57.13 81.76 69.97 50.79 42.83 Smaller value of standard deviation means smaller difference of experience tendency. A subway in Chengdu 43.83 A subway in Guangzhou 82.00 Standard subway 65.49 Handles A subway in Chengdu 51.55 A subway in Guangzhou 68.24 Remark: the unit of the first fixation time is second. Figure 6: Design of facilities. handles and seats were red with high saturation degree, which was in sharp contrast with the surrounding environment. Its value was larger than the Chengdu subway (70.27 > 43.83). Therefore, the three-dimensional design was better than the high saturation color design. The handles and seats of the theme subway in Guangzhou were gray and unified with the surrounding environment. Its value was the highest (82.00 > 70.27 > 43.83) and had a large gap with the eye movement data of the other subways. Therefore, model color design which was close to the environmental color had the least attractiveness and the lowest sensitivity to the early recognition reaction. Through the above analysis, it was concluded that there were two design methods of theme subway facility. The first one was design of color, and the second one was design of three-dimensional modeling. Both had advantages and disadvantages. The design strategies are shown in Table 2. In China, the Disney theme subway in Hongkong is a combination of three-dimensional modeling and color design, which conveys the theme of Mickey always accompanies with passengers. In the design of windows Table 2: The design strategy for hardware facilities in theme subway. Design method of facility Design of color Design of three-dimensional modeling High economical 1. The application of thematic image has efficiency: low strong visual attraction. construction cost 2. High identifiability and highly sensitive to Advantages the early recognition reaction. The application of personification design and skeuomorph makes facilities more visually hierarchical. 1. Weak visual attraction 1. Complex manufacturing technique 2. Low sensitivity to (customized design) reaction and weak 2. Higher cost compared to the design of Disadvantages identifiability standard subway 3. Similar design style 3. Individualized design for every theme, lacking of sustainability. Three-dimensional modeling + color design 1. Abstract design of thematic images (serialization design of handle, support Conclusions rod and connecting rod) for optimizing 2. Serialization design of handles and seats to strengthen content of theme strategy 3. Pay attention to the complementarity of facility color and surrounding environment in subway carriage. Table 3: The sorting of average retention time of different space interface. The sorting of average retention time of different space interfaces 12345 6 Guang zhou Top surface 6.33 End wall 4.80 Groun d 4.33 Side wall 3.47 Seat 1.79 Handrai l 1.46 Cheng du Side wall 3.81 Groun d 1.87 Seat 1.87 Top surface 1.56 End wall 1.35 Handrai l 1.28 Standa rd Side wall 2.84 Handra il 1.1.8 Groun d 1.11 Top surface 0.91 Seat 0.89 End wall 0.77 Remark: the unit of the retention time is second. and handles, the three-dimensional design of "Mickey head" is adopted. The three-piece handle attracts attentions of passengers by contrast colors, red, yellow and black. The arrangement of the seats breaks out the previous end-to-end arrangement. The L-shaped blue corner cloth sofa contrasts vividly with the yellow on the surrounding supports in the whole space. The echoing of color stimulus and theme modeling also leave a deep impression on people, as shown in Figure 7 [xv]. Figure 7: The design of internal facilities of the Disney theme subway. The space interface of subway carriage can be divided into top surface, ground, side walls and end walls; passengers pay more attentions to these four parts. Therefore, dual requirements of functional technology and the aesthetic level of space need to be satisfied. Taking the theme subway in Guangzhou and Chengdu as examples, the influence of wall design on visual retention and attention of users was discussed to conclude the design features of different locations inside the subway [xvi]. 4.3.2 The comparison of facility design in the same theme but at different positions First of all, variance analysis of retention time of eyes was made. The significance of position * theme Sig=0.027<0.05 indicated a significant difference; it meant that the decoration position was interactive with theme design. Eye retention time of the decoration Figure 9: Functional information signs and the distribution positions. position varied with the theme. Significance Sig=0<0.05 indicated a statistically significant difference in data of different locations. However, the variance analysis of the first fixation time found that significant of position * theme Sig=0.762>0.05 indicated no statistically significant difference, showing that the user was in the unconscious state and the decoration position was non-interactive with the theme design. For further analysis, data were processed by descriptive statistics, and the eye movement data at different locations on the same theme were sorted preferentially, as shown in Table 3. Through descriptive analysis, it was found that users had different degrees of information processing at different locations after entering the carriage. The comparison of the top three positions suggested there was a commonality although users had different fixation points. In all regions of interest, ground and side walls in all the carriages were observed, which conformed to the behavioral mode of people in subway; they were also the keys in the design. The side walls of Chengdu theme subway is designed based on panda and labeled with text information, as shown in Figure 8. The first fixation time of the side walls was 60.13 s, indicating that it was paid less attention to compared to the Guangzhou subway. But after a long-time observation, the fixation time of the side walls was 3.81 s; with a high readability, it could attract more attentions and interests. Therefore, it could be concluded that the side wall is an important position which users will pay attention to for a long time. Situational decorative design was not suitable for side walls because of the region segmentation and functional information, as shown in Figure 9. These signs can provide information services such as instructions, hints and warnings to passengers through visual communication, which plays a key role in the safety of passengers in the subway station. Without affecting the search of functional information, small texture design or three-dimensional modeling design can be used. Figure 8: Text information on the side walls of the Chengdu subway. The Guangzhou theme subway focused on texture design, and users paid more attentions to the top surface (6.33 s), the end wall (4.80 s) and the ground (4.33 s) which had complete content. Due to the pattern integrity and sufficient scene presentation of the top surface, users observed it for a long time and paid the most attentions to it. Therefore top surface was the best place to display the situational theme design; while maintaining the spatial integrity, it would not interfere with the search of the functional information inside the carriage. Currently, there are few theme subways with designed top surface, and it is also easily to be ignored by designers. For the design of ground, small area or monotonous color design can be used as it has certain behavioral functions and easy to wear because of the large staff mobility. In the process of experiment, the visual attention reaction of users was tested by changing the saturation of the side walls of the theme subway in Guangzhou. When the saturation was too high or low, the user's attention was quickly transferred to other areas. Therefore, it was concluded that color saturation was an important factor affecting the visual attention of users. It was suggested that designers use moderate saturation for the overall space of carriage, and the area that needs to be noticed by the user can be used in the contrast of high saturation color. The optimization strategy is shown in Table 4. Table 4: The summary of optimization strategies for the interface design of theme subway. Position Characteristics of positions Conclusions of optimization strategy Side walls 1. Region division and many functional information 2. Small designable area 3. An area which is focused on 4. High information readability Consideration for safety: 1. Design of small-area texture or design of three-dimensional modeling is allowed on the premise of not affect searching of functional messages. 2. Not suitable for large area of situational pattern design 3. Suitable for reading of text messages in theme design 4. Pattern design around guiding messages in the area of side door is not suitable as it will increase time of searching messages. 1. Large designable area Top 2. No region division on the wall surface surface, with a high integrity 3. Presenting visual height Consideration for comfort 1. Present situational theme design with design of texture 2. Not suitable for layout of text information 3. Passengers may feel reduced visual height and feel depressive because of complex pattern 4. The saturability, relative brightness and hue of patterns has large influence in improving the visual height of space (visual perception stratification). Consideration for function 1. Large designable area 1. Design of small-area texture or monotonous color 2. No region division on the wall design are feasible. Ground surface, with a high integrity 2. Large-area texture design is not suitable as the 3. With a behavioral function large passenger flow in the carriage is prone to cause 4. Easy to wear wearing. 5 Conclusion Several design strategies of color, shape and texture were developed based on virtual reality technology, reference to different theme subway space, eye movement data and subjective evaluation for the design of hardware facilities and interface in theme subway, which can provide a reference for future design. In future research, virtual reality technology in combination with eye movement technology can be used for the study of the spatial availability of the environment to offer users a better experience in the space. 6 Acknowledgement This study was supported by the 2017 Youth Fund of Humanities and Social Sciences Research of Ministry of Education under grant number of 17YJC760100. 7 References [i] Zhao JP (2016). Summary of Chinese theme subway. Time Report, (40). [ii] Wu LW (1989). Generalized Architecture. Tsinghua University Press. [iii] Guo XY, Wang ZS (2014). Metro Station Space Environment Design: method, procedures and example. China Water Power Press. [iv] James W (1890). The principles of psychology, Vol. 2. NY, US: Henry Holt and Company. https://doi.org/10.1037/10538-000 [v] Duchowski AT (2015). Eye Tracking Methodology Theory and Parctice. Zhao XB, Ziu XC, Zhou YJ translated. Beijing: Science Press, pp. 3-4. https://doi.org/10.1007/978-1-84628-609-4 [vi] Helmholtz H, Southall JPC (1925). Treatise on physiological optics. III. The perceptions of vision. [vii] Liang Y, Liu HZ (2010). Study of Image Retrieval Based on Vision Attention Mechanism. Journal of Beijing Union University, (1), pp. 30-35. [viii] Duchowski AT (2003). Eye Tracking Methodology: Theory and Practice. Springer London. https://doi.org/10.1007/978-1-84628-609-4 [ix] Eckhoff ND, Ervin PF (1971). Area-of-interest unfolding. Nuclear Instruments & Methods, 97(2), pp. 263-266. https://doi.org/10.1016/0029-554X(71)90280-1 [x] Hashimoto Y, Matsushita R (2012). Heat Map Scope Technique for Stacked Time-series Data Visualization. International Conference on Information Visualisation, IEEE, Montpellier, France, pp. 270-273. http://doi.ieeecomputersociety.org/10.1109/IV.2012 .53 [xi] Duan CH, Yan ZQ, Wang FX (2012). Eye Movement Research on the Influence of Animation Rendering Speed on Multimedia Learning Effect. National Psychological Academic Conference, pp. 46-53. [xii] Farooq U, Glauert J, Zia K (2017). Load Balancing for Virtual Worlds by Splitting and Merging Spatial Regions. Informatica, 42(1). [xiii] Liu XX (2016). Research on the Influencing Factors of Visual Cognition of “Dynamic Illumination”— —Taking the Nightscape Lighting of “Tokyo Skytree” in Japan as an Example. Decoration, (3), pp. 101-103. [xiv] Nini B (2013). Bit-projection based color image encryption using a virtual rotated view. Informatica, pp. 283-291. [xv] Zhao ZC, Zhao J, Zhou C (2015). Saving Design Practice and Exploration: The Conceptual Design of Public Space about Nanjing Metro Line 2. Art & Design, (8), pp. 87-89. [xvi] Zhong X (2017). The study of city public traffic visual identification system of regional characteristics-a case study of Chengdu. Southwest Jiaotong University. https://doi.org/10.31449/inf.v42i3.2454 Informatica 42 (2018) 477–482 477 Defect Features Recognition in 3D Industrial CT Images Haina Jiang Chongqing College of Electronic Engineering, Chongqing City, 401331, China Corr. address: No. 70-2-4-3, Huxi Garden, University Town, Shapingba District, Chongqing City, 401331, China E-mail: hainjiangcq@126.com Technical Paper Keywords: industrial CT image, defect location, image segmentation, feature extraction, feature recognition Received: September 15, 2018 Due to the limitations of production conditions, there is a certain probability that workpiece product has internal defects, which will have a certain impact on the performance of workpiece. Therefore, the internal defects detection of workpiece is essential. This study proposed a defect recognition method based on industrial computed tomography (CT) image to identify the internal defects of workpiece. The block fractal algorithm was used to locate the defect parts of the image, then the improved k-means clustering algorithm was used to segment the defect parts, and feature vector was extracted by Hu invariant moments. Finally, the firefly algorithm and radial basis function (RBF) neural network were combined to identify the defect. It was found from the experiments that the algorithm in this study had the accuracy of 97.89%, which proved the reliability of the algorithm and provided some suggestions for the defects recognition. Povzetek: Za prepoznavanje okvar na 3D slikah industrijskih izdelkov je uporabljena metoda vinske mušice. 1 Introduction Defect detection plays a very important role in the cost [8]. Samarawickrama et al. [9] made defect industrial field. Through defect detection, product quality detection of tile based on industrial CT images and found can be effectively improved. Alimohamadi et al. [1] it was more accurate and efficient than the manual proposed a new defect detection method based on the method. In this study, based on the industrial CT image, optimal Gabor wavelet filters, which combined with the defect was obtained through the localization and morphological analysis. The experimental results on segmentation of the defect image, then the feature different type of textiles showed that this algorithm was extraction was conducted by using Hu invariant moments, robust for defects detection of various kind of textile. and finally the RBF neural network which was optimized Chen et al. [2] proposed a new defect detection method by using the firefly algorithm was used for recognizing based on dual-tree complex wavelet transform (DT-CWT) defects to explore the reliability of this method in defect and took advantage of near shift-invariance of DT-CWT recognition. to extract weak defect feature. Experimental results demonstrated the validity of the proposed method. 2 Internal defect detection of Sabeenian et al. [3] presented an algorithm of defect workpiece detection by making use of multi-resolution combined statistics and spatial frequency method. The accuracy Due to the production technology, production conditions obtained by the simulation using MATLAB was found to and other aspects, workpiece products often have a be 99%, which proved the practicality of the algorithm. certain probability of internal defects. These defects not Liu et al. [4] optimized subtractive clustering method only affect the performance of workpiece, but also have (SCM) by Akaike information criterion (AIC) and then certain safety risks in the actual use process. Therefore, constructed radial basis function (RBF) model by using the detection of defects is an important part of the the obtained AIC-SCM algorithm, which improved the industrial production. Industrial CT image is an effective adaptability of the RBF model. Experimental results method for workpiece defect detection. With the showed that this method could identify defects with a development of technology, the performance of industrial high accuracy. Leng et al. [5] used convolutional neural CT image is gradually improving, and its cost is also network method in the detection and classification of reducing. It has been widely used in aerospace, military, galvanized stamping parts and obtained a precision of electronics, petroleum and other fields. Industrial CT 99.6%. Industrial computed tomography (CT) image is a images can easily be stored and analyzed, and enable simple and efficient method [6, 7] for internal defect quick and accurate detection of the presence or absence detection of workpiece, which can effectively detect of defects in workpiece, as well as evaluation of the size internal defects and significantly reduce the detection and location of defects [10-12]. It has higher resolution Figure 1: Flow chart of industrial CT image defect recognition. and adaptability, so images of different gray levels can be effectively detected. At present, the detection of defects in CT images is mostly carried out manually with low accuracy. An intelligent identification method can effectively improve efficiency and reduce errors. This is the development direction of defect detection methods. The process of defect detection based on industrial CT images proposed in this study is shown in Figure 1. Defect detection algorithm 3.1 Defect localization algorithm based on block fractal Automatic defect localization was performed using fractal theory [13]. Fractal theory is put forward by Mandelbrot, which has been extensively applied in graphics and geography. Fractal dimension [14] was obtained by Blanket algorithm which is put forward by Peleg; then the block with defects was marked, and the defect was localized. 3.1.1 Blanket algorithm Suppose the gray function of the image is , imagine a blanket covering the gray surface of the image, suppose the upper surface as , the lower surface as , and scale as , then the upper and lower surface under different scales can be expressed as: , (1) , (2) . (3) According to the above expressions, the area and volume of gray surface can be calculated, and the fractal area can be obtained. The relation between the fractal area and the fractal dimension is: , (4) where W stands for the fractal dimension, and stands for the constant. The following expression can be obtained by calculation: . (5) We see that the area of the fractal dimension is linearly related to the logarithm of the fractal dimension, and the slope of the line can be calculated to obtain the fractal dimension : . (6) 3.1.2 Block fractal algorithm (1) The image is divided into rectangular regions of the same size. (2) The corresponding fractal area of different scales on each region is calculated by the Blanket algorithm, and can be obtained. (3) The fractal dimension W can be calculated according to formula (6). (4) Mark the fractal threshold as K. If the fractal dimension of the block is greater than this threshold, it indicates that there is an edge, and the part with the edge is marked with white. (5) Determine if the marked blocks are workpiece edges or defect areas. As the number of blocks in defect areas is less than the workpiece edges, through calculation, if the number of blocks is greater than the connected threshold value T, it means that they are workpiece edges. (6) Remember the defect area of the workpiece after removing the workpiece edge. 3.2 Image segmentation algorithm based on improved k-means clustering The traditional K-means clustering algorithm may reduce the reliability. To make up the deficiencies of the algorithm, the initial clustering center automatic generation algorithm [15] was used to improve the traditional algorithm. Suppose that T is the coordinate set of image data set D, stands for the gray value of somewhere in D, stands for K classes in the clustering process, stands for K clustering centers, and i stands for the number of iterations. The algorithm steps are as follows: (1) Determine K and the accuracy of clustering . (2) The clustering center is generated by the initial clustering center automatic generation algorithm. (3) Take each initial clustering center as the set member of the initial cluster , and . (4) Conduct the iterations and divide into one cluster according to the minimum distance, i.e., , , where stands for the distance function of the algorithm, i.e., the distance between and the clustering center in the i-th iteration. (5) Reset the clustering center and then cluster again. Suppose , then its clustering center is: , where stands for the feature points number of at the i-th iteration. The resetting of clustering center of is also carried out. (6) Repeat (4) and (5) until the clustering center remains unchanged or . (7) Output image segmentation results according to clustering results. The number of clustering K stands for the number of peak values of gray histogram of reference image or the type of reference image object. The distance function is: . 3.3 Feature extraction algorithm based on Hu invariant moments Three common defects in workpieces are stomata, cracks and slag inclusion. The shape and gray information of the three defects are very different, and the feature information can be extracted for identification. (1) Shape features . The length-width ratio of the defect part is , where R stands for the long axis and K stands for the short axis. . The circularity of the defect is , where L2 stands for square of circumference and A stands for area. (2) Gray information , Establishment of RBF neural network The number of nodes in the input layer and output layer of the neural network is 14 and 3, the hidden layer and input layer are the same, and 001, 010 and 100 stands for the stomata, cracks and slag inclusion respectively. (1) Weight threshold optimization . Initialize parameters: stands for the volatilization rate of luciferin at t-1, stands for the update rate of luciferin, stands for the change rate of field, s stands , where Mean stands for the Mean value of grayscale image, v stands for the variance of grayscale, h(x,y) stands for the gray value of the defect pixel points, and n stands for the number of pixels. Seven invariant moments, .1-.7, can be obtained according to Hu invariant moment theory [16]. Table 1 is obtained after abstraction on the moments. R1 R2 R3 R4 R5 R6 R7 R8 R9 R10 Table 1: Feature value obtained from the abstracted Hu invariant moment. The above 10 feature values, two shape features (length-width ratio and degree of circularity) and two gray features (gray average and gray variance) can be used as feature vectors to identify the defects. 3.4 Defect recognition algorithm based on firefly neural network In this study, a combination of firefly algorithm and RBF neural network was adopted to identify the defect feature. The flowchart of the algorithm is shown in Figure 2. for the moving step length, rs stands for the threshold of the perceived range of firefly, and nt stands for the threshold of the number of neighbor fireflies. . Initialization algorithm: The current position of firefly i is , each firefly has the same luciferin and the same decision radius . . The update formula of luciferin is , stands for the luciferin of firefly i at t-1, stands for the Figure 2: Flow chart of firefly neural network. position fitness value of firefly i at t. . The neighbor fireflies set formula is , where stands for the neighbor fireflies set of fireflies i at t, stands for the Euclidean distance, and stands for the decision radius of fireflies i at t. The probability that firefly i chooses j as neighbor firefly is , the position updating formula is , and the updating formula of decision radius is , where stands for the perceived range of firefly i at t, , and stands for the size of neighbor set. . After the iteration, it is judged whether the iteration number reaches the maximum. If it does, the algorithm is finished and the optimal value is recorded; if not, the iteration is continued. (3) The above data are used as training samples for neural network testing and training. Example analysis of defect workpiece In order to verify the correctness of the method in this study, defect recognition was carried out on 100 industrial CT images of solid rocket engine model which was in a size of 512 × 512. The material of the motor body was 30GrMnSiA, the length of the motor grain was 1000 mm, and the external diameter was 150 mm. The artificial detection results were 80 defective images, and 20 non-defective images and 142 defects). 4.1 Defect localization results The defects of the industrial CT images were positioned. H. Jiang The size of block was a quarter. The calculation results of fractal dimension are shown in Figure 3. Figure 3: The frequency distribution of fractal dimension. Fractal dimension has a large influence on the accuracy of defect marks, it can be noted from Figure 3 that image defects could be clearly positioned when the fractal dimension was between 2 and 2.1 and the frequency was 0.85. Therefore, the fractal dimension was set as 2.1. The defect localization results are shown in Figure 4. As shown in Figure 3, the algorithm used in this study can locate the defect area of the workpiece accurately and facilitate the subsequent defect detection. 4.2 Defect segmentation results One hundred industrial CT images (80 defective images and 20 non-defective images) were processed by the improved k-mean clustering algorithm, and the results obtained were compared with the results of manual judgment, as shown in Table 2. As can be seen from the Table 2, the algorithm successfully segmented 78 defective images, only one non-defective image was wrongly segmented, and the overall segmentation accuracy rate was relatively high, indicating that the proposed segmentation algorithm was highly reliable. Figure 4: Defect localization results. Defective images Non-defective images Number of segmented defects 78 1 Number of unsegmented defects 2 19 Accuracy rate 97.5% 95% Table 2: Defect segmentation results. 4.3 Feature extraction results The feature extraction of defect images was carried out by the method of abstract invariant moments. Taking the stomata as an example, its feature quantity is shown in Table 3. Original image Translated Image Image which is clockwise rotated for 90 degrees R1 0.315687 0.314256 0.312456 R2 1.935621 1.935124 1.935214 R3 4.500254 4.502103 4.505321 R4 3.785214 3.782158 3.780215 R5 2.124521 2.125632 2.120325 R6 0.234665 0.239654 0.236589 R7 0.621453 0.621036 0.625879 R8 0.235462 0.231456 0.236587 R9 0.625471 0.620852 0.623168 R10 0.442123 0.441258 0.446852 Table 3: Defect feature value of stomata. These 10 feature values were extracted, plus two shape features and two gray features, a total of 14 feature vectors were obtained. 4.4 Feature recognition results The theoretical output value and actual output value of 10 defects identified in 8 pictures are shown in Table 4. The theoretical output values of the neural network should be 001 (crack), 010 (stomata) and 100 (slag inclusion), but there always exists error in the actual output. Therefore, the error was controlled to 0.2, and the actual output less than 0.2 was rounded to 0, while larger than 0.8 was rounded to 1. Only the recognition of A7 was wrong in the ten defects of the Table 4. The 142 defects in the processed 100 CT images were recognized, and 139 defects were correctly recognized, and 3 defects were misjudged. The accuracy rate was 97.89%, which indicated that the defect recognition method in this study had high reliability. 5 Discussion and conclusion The internal defects of a workpiece can greatly affect the practicability and safety of the workpiece. With the emphasis on the workpiece quality, the internal defect Table 4: The comparison between the theoretical output and the actual output. No. The theoretical output The actual output A1 0 0 1 0.02132 0.01253 0.91021 A2 0 0 1 0.01023 0.02154 0.92521 A3 1 0 0 0.89652 0.10235 0.02365 A4 0 1 0 0.02158 0.95213 0.02157 A5 0 0 1 0.01245 0.08521 0.94587 A6 0 1 0 0.01852 0.95210 0.01658 A7 1 0 0 0.89658 0.42011 0.02856 A8 0 1 0 0.02145 0.90258 0.01589 A9 0 0 1 0.12035 0.02157 0.96324 A10 1 0 0 0.95462 0.02145 0.01856 detection technology of the workpiece has been developed. Common internal defect detection technologies of the workpiece include ultrasonic, laser holography, X-ray photography, etc. Industrial CT images are currently the most effective non-destructive testing technology [17], making it easier to identify defects. Defect recognition based on industrial CT images is a simple and efficient method. Before defect recognition, it is necessary to locate and segment defects in the image. Defect localization can obtain the location information of defects from CT images, including the method of fractal, Gabor wavelet, statistics, etc. Yang et al [18] proposed a localization method based on cubature Kalman smooth filter, which can effectively locate defects. In this study, the block fractal algorithm was selected for image localization, and an industrial CT image was taken as an example. It was found that the algorithm can successfully locate the defect part. The improved k-means clustering algorithm was selected to segment the image. The experiments showed the accuracy of the algorithm more than 95%, which provided a good foundation for the following defect feature extraction and recognition. Hu invariant moment theory was adopted in this study for defect feature extraction. The abstraction of Hu invariant moment can make it better to extract features. Then, ten feature vectors can be obtained, plus the two shape features and the two gray features equal the total of 14 feature vectors which were used in feature recognition. For feature recognition there exist numerous methods such as artificial neural networks (ANN), support vector machine (SVM), principal component analysis (PCA) and other. In this study, the RBF neural network was selected to recognize the features, and then the weights and the threshold of the neural network were optimized by using the firefly algorithm. The accuracy rate of 97.89% was obtained in the experiments, which proved the reliability of the defect recognition algorithm in this study. Industrial CT image is one of the effective methods for non-destructive testing of workpiece. In this study, based on the industrial CT image, the image defects were located by the block fractal algorithm, then the improved k-means clustering algorithm was used to segment the defect image, the abstracted Hu invariant moment algorithm was adopted for feature extraction, and finally the firefly algorithm and the RBF neural network were used for feature recognition. 6 Acknowledgement We completed this paper based on the project of contours extraction from industrial computed tomography images supported by Chongqing Municipal Education Committee. 7 References [1] Alimohamadi H, Ahmadyfard A, Shojaee E. (2010). Defect Detection in Textiles Using Morphological Analysis of Optimal Gabor Wavelet Filter Response. International Conference on Computer and Automation Engineering, IEEE, Bangkok, Thailand, pp. 26-30. https://doi.org/10.1109/ICCAE.2009.43 [2] Chen Z, Wang C, Hu Z, Xie S. (2010). Dual-tree complex wavelet analysis and its application in defect detection of workpiece for cross wedge rolling. International Conference on Advanced Computer Theory and Engineering, IEEE, Chengdu, China, pp. 861-72. https://doi.org/10.1109/ICACTE.2010.5579227 [3] Sabeenian RS, Paramasivam ME. (2010). Defect detection and identification in textile fabrics using Multi Resolution Combined Statistical and Spatial Frequency Method. Advance Computing Conference, IEEE, Patiala, India, pp. 162-166. https://doi.org/10.1109/IADCC.2010.5423017 [4] Liu BT, Hou DB, Liu BL, Zhao L, Huang PJ, Zhang GX. (2014). Defect detection and identification in eddy current testing using subtractive clustering algorithm combined with RBFNN. Insight -Non-Destructive Testing and Condition Monitoring, 56(7), pp. 375-380(6). https://doi.org/10.1784/insi.2014.56.7.375 [5] Leng Y, Xiao Z, Geng L, Xi J. (2018). Defect detection and classification of galvanized stamping parts based on fully convolution neural network. International Conference on Graphic and Image Processing, pp. 188. https://doi.org/10.4028/www.scientific.net/AMR.6 75.55 [6] Zhang RF. (2013). Study on CT Image Reconstruction Applications in Industry. Advanced Materials Research, pp. 55-58. https://doi.org/10.4028/www.scientific.net/AMR.6 75.55 [7] Carmignato S. (2012). Accuracy of industrial computed tomography measurements: Experimental results from an international comparison. CIRP Annals -Manufacturing H. Jiang Technology, 61(1), pp. 491-494. https://doi.org/10.1016/j.cirp.2012.03.021 [8] Chen L. (2011). The Application and Investigation about Industry CT Scan Technology in the Measure and Design about Complex Box. Applied Mechanics & Materials, pp. 319-322. https://doi.org/10.4028/www.scientific.net/AMM.8 6.319 [9] Samarawickrama Y C, Wickramasinghe C D. (2017). Matlab based automated surface defect detection system for ceremic tiles using image processing. Technology and Management, IEEE, 34-39. https://doi.org/10.1109/NCTM.2017.7872824 [10] Jiang SQ, Luan CB, Man YE, Zhao YY. (2017). Application of Industry CT in Large Complicated Casing Inspection. Nondestructive Testing, 39(2), pp. 18-21. https://doi.org/10.1109/NSSMIC.2011.6154616 [11] Chang M, Xiao Y, Chen Z, Li L. (2011). Preliminary study of rotary motion blurs in a novel industry CT imaging system. Nuclear Science Symposium and Medical Imaging Conference. IEEE, Valencia, Spain, pp. 1358-1361. https://doi.org/10.1109/NSSMIC.2011.6154616 [12] Zabler S, Fella C, Dietrich A, Nachtrab F, Salamon M, Voland V, Ebensperger T, Oeckl S, Hanke R, Uhlmann N. (2012). High-resolution and high-speed CT in industry and research. SPIE Optical Engineering + Applications, San Diego, California, United States, pp. 850617. https://doi.org/10.1117/12.964588 [13] Mandelbrot BB, Wheeler JA. (1998). The Fractal Geometry of Nature. American Journal of Physics, 51(4), pp. 468 p. https://doi.org/10.1119/1.13295 [14] Normand MD, Peleg M. (1988). Evaluation of the ‘blanket’ algorithm for ruggedness assessment. Powder Technology, 54(4), pp. 255-259. https://doi.org/10.1016/0032-5910(88)80055-X [15] Tang YH, Gong A, Wang C. (2012). Automatic Generation Algorithms Based on Optimization Initial Population. Computer & Modernization, 562(1), pp. 131–147. [16] Yan BJ, Zheng L, Wang KY. (2001). Fast Target-Detecting Algorithm Based on Invariant Moment. Infrared Technology, 23(6), pp. 8-12. [17] Xie L, Huang R, Gu N, Cao Z. (2014). A novel defect detection and identification method in optical inspection. Neural Computing & Applications, 24(7-8), pp. 1953-1962. https://doi.org/10.1007/s00521-013-1442-7 [18] Yang L, Li H, Zhou F, Jin P. (2015). The Pipeline Defect Location Technology Based on Cubature Kalman Smooth Filter. Chinese Journal of Sensors & Actuators, 28(4), pp. 591-597. https://doi.org/10.3969/j.issn.1004­ 1699.2015.04.023 JOŽEF STEFAN INSTITUTE Jožef Stefan (1835-1893) was one of the most prominent physicists of the 19th century. Born to Slovene parents, he obtained his Ph.D. atVienna University, where he was laterDirectorofthe Physics Institute,Vice-Presidentofthe Vienna Academy of Sciences and a member of several sci­entifc institutions in Europe. Stefan explored many areas in hydrodynamics, optics, acoustics, electricity, magnetism and the kinetic theory of gases. Among other things, he originated the law that the total radiation from a black body is proportional to the 4th power of its absolute tem­perature, known as the Stefan–Boltzmann law. The Jožef Stefan Institute (JSI) is the leading indepen­dent scientifc research institution in Slovenia, covering a broad spectrum of fundamental and applied research in the felds of physics, chemistry and biochemistry, electronics and information science, nuclear science technology, en­ergy research and environmental science. The JožefStefanInstitute(JSI)isaresearchorganisation for pure and applied research in the natural sciences and technology. Both are closely interconnected in research de­partments composed of different task teams. Emphasis in basic researchisgiventothedevelopmentand educationof young scientists, while applied research and development serve for the transfer of advanced knowledge, contributing to the development of the national economy and society in general. At present the Institute, with a total of about 900 staff, has700 researchers,about250ofwhomare postgraduates, around 500 of whom have doctorates (Ph.D.), and around 200 of whom have permanent professorships or temporary teaching assignments at the Universities. In view of its activities and status, the JSI plays the role of a national institute, complementing the role of the uni­versities and bridging thegap between basic science and applications. Research at the JSI includes the following major felds: physics; chemistry; electronics, informatics and computer sciences; biochemistry; ecology; reactor technology; ap­plied mathematics. Most of the activities are more or less closely connected to information sciences, in particu­lar computer sciences, artifcial intelligence, language and speech technologies, computer-aided design, computer ar­chitectures, biocybernetics and robotics, computer automa­tion and control, professional electronics, digital communi­cations and networks, and applied mathematics. The Institute is located in Ljubljana, the capital of the in­dependent state ofSlovenia (orS¦nia). The capital today is considereda crossroad between East,West and Mediter­ranean Europe, offering excellent productive capabilities and solidbusiness opportunities, with strong international connections. Ljubljana is connected to important centers such as Prague, Budapest,Vienna, Zagreb, Milan, Rome, Monaco, Nice, Bernand Munich,all withina radiusof600 km. From the Jožef Stefan Institute, the Technology park “Ljubljana” has been proposed as part of the national strat­egy for technological development to foster synergies be­tween research and industry, to promote joint ventures be­tween university bodies, research institutes and innovative industry, to act as an incubator for high-tech initiatives and to acceleratethedevelopmentcycleof innovative products. Part of the Institute was reorganized into several high-tech units supportedby and connected within theTechnol­ogy park at the Jožef Stefan Institute, established as the beginningofa regionalTechnology park "Ljubljana". The project was developed at a particularly historical moment, characterizedbythe processof state reorganisation,privati­sation and private initiative. The nationalTechnologyPark isa shareholding companyhosting an independentventure-capital institution. The promoters and operational entities of the project are the Republic of Slovenia, Ministry of Higher Education, Science andTechnology and the Jožef Stefan Institute. The framework of the operation also includes the University of Ljubljana, the National Institute of Chemistry, the Institute for Electronics andVacuum Technology and the Institute for Materials and Construction Research among others. In addition, the project is supported by the Ministry of the Economy, the National Chamber of Economy and the City of Ljubljana. Jožef Stefan Institute Jamova 39, 1000 Ljubljana, Slovenia Tel.:+38614773 900,Fax.:+38612519385 WWW:http://www.ijs.si E-mail: matjaz.gams@ijs.si Public relations: Polona Strnad INFORMATICA AN INTERNATIONAL JOURNAL OF COMPUTING AND INFORMATICS INVITATION, COOPERATION Submissions and Refereeing Please register as an author and submit a manuscript at: http://www.informatica.si. At least two referees outside the au­thor’s country will examine it, and they are invited to make as manyremarks as possible from typing errors to global philosoph­ical disagreements. The chosen editor will send the author the obtained reviews. If the paper is accepted, the editor will also send an email to the managing editor. The executive board will inform the author that the paper has been accepted, and the author will send the paper to the managing editor. The paper will be pub­lished within one year of receipt of email with the text in Infor­matica MSWord format or InformaticaLATEXformat and fgures in.eps format.Styleandexamplesof paperscanbe obtainedfrom http://www.informatica.si. Opinions, news, calls for conferences, calls for papers, etc. should be sent directly to the managing edi­tor. SUBSCRIPTION Please,completethe orderformandsendittoDr.DragoTorkar, Informatica, Institut Jožef Stefan, Jamova 39, 1000 Ljubljana, Slovenia. E-mail: drago.torkar@ijs.si Since 1977, Informatica has been a major Slovenian scientifc journal of computing and informatics, including telecommunica­tions, automation and other related areas. In its 16th year (more than twentyfour years ago) it became truly international, although it still remains connected to Central Europe. The basic aim of In­formatica is to impose intellectual values (science, engineering) in a distributed organisation. Informatica is a journal primarily covering intelligent systems in the European computer science, informatics and cognitive com­munity; scientifc and educational as well as technical, commer­cial and industrial. Its basic aimis to enhance communications between different European structures on the basis of equal rights and international refereeing. It publishes scientifc papers ac-ceptedbyat leasttwo referees outsidethe author’scountry.Inad­dition, it contains information about conferences, opinions, criti­calexaminationsofexisting publicationsandnews. Finally,major practical achievements and innovations in the computer and infor­mation industry are presented through commercial publications as well as through independent evaluations. Editing and refereeing are distributed. Each editor can conduct the refereeing process by appointing two new referees or referees from the Board of Referees or Editorial Board. Referees should not be from the author’s country. If new referees are appointed, their names will appear in the Refereeing Board. Informatica web edition is free of charge and accessible at http://www.informatica.si. Informatica print edition is free of charge for major scientifc, ed­ucational and governmental institutions. Others should subscribe. Informatica WWW: http://www.informatica.si/ Referees from 2008 on: A. Abraham,S. Abraham,R. Accornero,A. Adhikari,R. Ahmad,G. Alvarez,N. Anciaux,R. Arora,I.Awan,J. Azimi,C. Badica,Z. Balogh,S. Banerjee,G. Barbier,A. Baruzzo,B. Batagelj,T. Beaubouef,N. Beaulieu,M. ter Beek,P. Bellavista,K. Bilal,S. Bishop,J. Bodlaj,M. Bohanec,D. Bolme,Z. Bonikowski,B. Boškovi´c,M. Botta, P. Brazdil,J. Brest,J. Brichau,A. Brodnik,D. Brown,I. Bruha,M. Bruynooghe,W. Buntine, D.D. Burdescu,J. Buys,X. Cai,Y. Cai, J.C. Cano,T. Cao, J.-V. Capella-Hernández,N. Carver,M.Cavazza,R.Ceylan,A. Chebotko, I. Chekalov,J. Chen, L.-M. Cheng,G. Chiola,Y.-C. Chiou,I. Chorbev, S.R. Choudhary, S.S.M. Chow, K.R. Chowdhury,V. Christlein,W. Chu,L. Chung,M. Ciglariˇ c, J.-N. Colin,V. Cortellessa,J. Cui,P. Cui,Z. Cui,D. Cutting,A. Cuzzocrea,V. Cvjetkovic,J. Cypryjanski,L. ˇCerepnalkoski,I. ˇc,G. Daniele,G. Cehovin,D. ˇCosi´Danoy,M. Dash,S. Datt,A. Datta, M.-Y.Day,F. Debili, C.J. Debono,J. Dediˇc,P.Degano,A. Dekdouk,H. Demirel,B. Demoen,S. Dendamrongvit,T. Deng,A. Derezinska,J. Dezert,G. Dias,I. Dimitrovski,S. Dobrišek, Q. Dou, J. Doumen, E. Dovgan, B. Dragovich, D. Drajic, O. Drbohlav, M. Drole, J. Dujmovi´c, O. Ebers, J. Eder, S. Elaluf-Calderwood,E. Engstr,U. riza Erturk,A.Farago,C. Fei,L. Feng,Y.X. Feng,B. Filipiˇc,I. Fister,I. FisterJr.,D. Fišer,A. Flores,V.A.Fomichov,S.Forli,A. Freitas,J. Fridrich,S. Friedman,C.Fu,X.Fu,T. Fujimoto, G. Fung, S. Gabrielli, D. Galindo, A. Gambarara, M. Gams, M. Ganzha, J. Garbajosa, R. Gennari, G. Georgeson, N. Gligori´c, S. Goel, G.H. Gonnet, D.S. Goodsell, S. Gordillo, J. Gore, M. Grˇcar, M. Grgurovi´c, D. Grosse, Z.-H. Guan, D. Gubiani, M. Guid, C. Guo, B. Gupta, M. Gusev, M. Hahsler, Z. Haiping, A. Hameed, C. Hamzaçebi, Q.-L. Han,H. Hanping,T. Härder, J.N. Hatzopoulos,S. Hazelhurst,K. Hempstalk, J.M.G. Hidalgo,J. Hodgson, M. Holbl, M.P. Hong, G. Howells, M. Hu, J. Hyvärinen, D. Ienco, B. Ionescu, R. Irfan, N. Jaisankar, D. Jakobovic,K. Jassem,I.Jawhar,Y. Jia,T. Jin,I. Jureta,.. Juri´ciˇ´ c,S.K,S. Kalajdziski,Y. Kalantidis,B. Kaluža, D. Kanellopoulos,R. Kapoor,D. Karapetyan,A. Kassler, D.S. Katz,A.Kaveh, S.U. Khan,M. Khattak,V. Khomenko, E.S. Khorasani,I. Kitanovski,D.Kocev,J.Kocijan,J.Kollár,A.Kontostathis,P.Korošec,A. Koschmider, D.Košir, J.Kovaˇ c,A. Krajnc,M. Krevs,J. Krogstie,P. Krsek,M.Kubat,M.Kukar,A.Kulis, A.P.S. Kumar, H. Kwa´ snicka,W.K. Lai, C.-S. Laih, K.-Y. Lam,N. Landwehr,J. Lanir,A.Lavrov,M. Layouni,G. Leban, A.Lee,Y.-C.Lee,U.Legat,A. Leonardis,G.Li,G.-Z.Li,J.Li,X.Li,X.Li,Y.Li,Y.Li,S.Lian,L.Liao,C.Lim, J.-C.Lin,H.Liu,J.Liu,P.Liu,X.Liu,X.Liu,F.Logist,S.Loskovska,H.Lu,Z.Lu,X.Luo,M. Luštrek,I.V. Lyustig, S.A. Madani,M. Mahoney, S.U.R. Malik,Y. Marinakis,D. Marinciˇˇ c, J. Marques-Silva, A. Martin, D. Marwede, M. Matijaševi´ c,T. Matsui,L. McMillan,A. McPherson,A. McPherson,Z. Meng, M.C. Mihaescu,V. Milea,N. Min-Allah,E. Minisci,V. Miši´ c, A.-H. Mogos,P. Mohapatra, D.D. Monica,A. Montanari,A. Moroni,J. Mosegaard,M. Moškon,L.deM. Mourelle,H. Moustafa,M. Možina,M. Mrak,Y.Mu,J. Mula,D.Nagamalai, M.Di Natale,A.Navarra,P.Navrat,N. Nedjah,R. Nejabati,W.Ng,Z.Ni, E.S. Nielsen,O. Nouali,F.Novak,B. Novikov,P. Nurmi,D. Obrul,B. Oliboni,X.Pan,M.Panˇc, B.-K. cur,W.Pang, G.Papa, M.Paprzycki, M.ParaliˇPark,P.Patel,T.B. Pedersen,Z. Peng, R.G. Pensa,J. Perš,D. Petcu,B. Petelin,M. Petkovšek,D.Pevec,M. Piˇcan,M. Polo,V. Pomponiu,E. Popescu,D. Poshyvanyk,B. Potoˇ culin,R. Piltaver,E. Pirogova,V. Podpeˇcnik, R.J.Povinelli, S.R.M. Prasanna,K. Pripuži´c,G. Puppis,H. Qian,Y. Qian,L. Qiao,C. Qin,J. Que, J.-J. Quisquater,C. Rafe,S. Rahimi,V.Rajkovi ˇc,J. Ramaekers,J. Ramon,R.Ravnik,Y. Reddy,W. c, D. Rakovi´Reimche, H. Rezankova, D. Rispoli, B. Ristevski, B. Robiˇ c, J.A. Rodriguez-Aguilar,P. Rohatgi,W. Rossak,I. Rožanc,J. Rupnik, S.B. Sadkhan,K. Saeed,M. Saeki, K.S.M. Sahari,C. Sakharwade,E. Sakkopoulos,P. Sala, M.H. Samadzadeh, J.S. Sandhu,P. Scaglioso,V. Schau,W. Schempp,J. Seberry,A. Senanayake,M. Senobari, T.C. Seong,S. Shamala, c. shi,Z. Shi,L. Shiguo,N. Shilov, Z.-E.H. Slimane,F. Smith,H. Sneed,P. Sokolowski, T. Song, A. Soppera, A. Sorniotti, M. Stajdohar, L. Stanescu, D. Strnad, X. Sun, L. Šajn, R. Šenkeˇrík, M.R. Šikonja,J. Šilc,I. Škrjanc,T. Štajner,B. Šter,V. Štruc,H.Takizawa,C.Talcott,N.Tomasev,D.Torkar,S. Torrente,M.Trampuš,C.Tranoris,K.Trojacanec,M. Tschierschke,F.DeTurck,J.Twycross,N. Tziritas,W. Vanhoof,P.Vateekul, L.A.Vese,A.Visconti,B. Vlaoviˇc,V.Vojisavljevi´c,M.Vozalis,P. Vraˇcar,V. Vrani´c, C.-H. Wang,H.Wang,H.Wang,H.Wang,S.Wang, X.-F.Wang,X.Wang,Y.Wang,A.Wasilewska,S.Wenzel,V. Wickramasinghe,J.Wong,S. Wrobel,K. Wrona,B.Wu,L. Xiang,Y. Xiang,D. Xiao,F. Xie,L. Xie,Z. Xing,H. Yang,X.Yang, N.Y.Yen,C.Yong-Sheng, J.J.You,G.Yu,X. Zabulis,A. Zainal,A. Zamuda,M. Zand,Z. Zhang, Z. Zhao,D. Zheng,J. Zheng,X. Zheng, Z.-H. Zhou,F. Zhuang,A. Zimmermann, M.J. Zuo,B. Zupan,M. Zuqiang, B. Žalik, J. Žižka, Informatica An International Journal of Computing and Informatics Web edition of Informatica may be accessed at: http://www.informatica.si. Subscription Information Informatica (ISSN 0350-5596) is published four times a year in Spring, Summer, Autumn,andWinter(4 issuesperyear)bytheSloveneSociety Informatika, Litostrojska cesta54,1000 Ljubljana, Slovenia. The subscription rate for 2018 (Volume 42) is – 60 EUR for institutions, – 30 EUR for individuals, and – 15 EUR for students Claims for missing issues will be honored free of charge within six months after the publication date of the issue. Typesetting: Borut Žnidar. Printing: ABO grafka d.o.o., Ob železnici 16, 1000 Ljubljana. Ordersmaybeplacedbyemail (drago.torkar@ijs.si), telephone(+38614773900)orfax(+38612519385).The payment shouldbemadetoourbank accountno.: 02083-0013014662atNLBd.d.,1520 Ljubljana,Trgrepublike 2, Slovenija, IBAN no.: SI56020830013014662, SWIFT Code: LJBASI2X. Informaticais publishedby Slovene Society Informatika (president Niko Schlamberger)in cooperation with the following societies (and contact persons): Slovene Society forPattern Recognition(Vitomir Štruc) Slovenian Artifcial Intelligence Society (Mitja Luštrek) Cognitive Science Society (Olga Markiˇ c) Slovenian Society of Mathematicians, Physicists and Astronomers (Marej Brešar) Automatic Control Society of Slovenia (Nenad Muškinja) Slovenian AssociationofTechnical and Natural Sciences/Engineering Academyof Slovenia (Mark Pleško) ACM Slovenia (Borut Žalik) Informatica is fnancially supported by the Slovenian research agencyfrom the Call for co-fnancing of scientifc periodical publications. Informaticais surveyedby:ACM Digital Library, Citeseer, COBISS, Compendex, Computer&Information Systems Abstracts, Computer Database, Computer Science Index, Current Mathematical Publications, DBLP Computer Science Bibliography, Directory of Open Access Journals, InfoTrac OneFile, Inspec, Linguistic and Language Behaviour Abstracts, Mathematical Reviews, MatSciNet, MatSci on SilverPlatter, Scopus, Zentralblatt Math Volume42 Number3September 2018 ISSN 0350-5596 IJCAI 2018 -Chinese Dominance Established M. Gams Introduction to the Special Issue on "SoICT 2017" Spectrum Utilization Effciencyof Elastic Optical Networks Utilizing Coarse Granular Routing Time-stamp Incremental Checkpointing and its Application for an Optimization of Execution Model to Improve Performance of CAPE SHIOT:ANovel SDN-based Framework for the Heterogeneous Internet of Things USL:ADomain-Specifc Language for Precise Specifcationof Use Cases and ItsTransformations Effective Deep Multi-source Multi-task Learning Frameworks for Smile Detection, Emotion Recognition and Gender Classifcation Alignment-free Sequence Searching over Whole Genomes Using 3D Random Plot of Query DNA Sequences Endof Special Issue/Startof normal papers Cancelable Fingerprint Features Using ChaffPoints Encapsulation Using Semantic Perimeters with Ontologies to Evaluate the Semantic Similarityof ScientifcPapers Improved Local Search Based Approximation Algorithm for Hard Uniform Capacitated k-Median Problem EffcientTrajectory Data PrivacyProtection Scheme Based on Laplace’s Differential Privacy AHybridParticleSwarm Optimization and DifferentialEvolution BasedTest Data Generation Algorithm for Data-Flow Coverage Using Neighbourhood Search Strategy Qualitative and Quantitative Optimization for Dependability Analysis Bio-IR-M:AMulti-Paradigm Modelling for Bio-Inspired Multi-Agent Systems Empirical Study on the Optimization Strategy of Subject Metro Design Based onVirtual Reality Defect Features Recognition in 3D Industrial CT Images H.T.T. Binh, I. Ide H.-C. Le, N.T. Dang 291 293 V.L. Tran, É. Renault, V.H. Ha, X.H. Do 301 H.-A. Tran, D. Tran, L.-G. Nguyen, Q.-T. Ha, V. Tong, A. Mellouk C.T.M. Hue, D.D. Hanh, N.N. Binh, L.M. Duc D.V. Sang, L.T.B. Cuong 313 325 345 D.-Y. Lee, H.-S. Tak, H.-H. Kim, H.-G. Cho 357 M.S. Al-Tarawneh S. Iltache, C. Comparot, M.S. Mohammed, P.J. Charrel S. Grover, N. Gupta, A.Panchol K.Gu,L.Yang,Y. Liu, B.Yin S.Varshney,M. Mehrotra L. Boucerredj, N. Debbache D. Zeghida, D. Meslati, N. Bounour Z.Wu H. Jiang 369 375 401 407 417 439 451 467 477 Informatica 42 (2018) Number 3, pp. 285–483