Informatica An International Journal of Computing and Informatics The Slovene Society Informatika, Ljubljana, Slovenia (^ìi) EDITORIAL BOARDS, PUBLISHING COUNCIL Informatica is a journal primarily covering the European computer science and informatics community; scientific and educational as well as technical, commercial and industrial. Its basic aim is to enhance communications between different European structures on the basis of equal rights and international referee-ing. It publishes scientific papers accepted by at least two referees outside the author's country. In addition, it contains information about conferences, opinions, critical examinations of existing publications and news. Finally, major practical achievements and innovations in the computer and information industry are presented through commercial publications as well as through independent evaluations. Editing and refereeing are distributed. Each editor from the Editorial Board can conduct the refereeing process by appointing two new referees or referees from the Board of Referees or Editorial Board. Referees should not be from the author's country. If new referees are appointed, their names will appear in the list of referees. Each paper bears the name of the editor who appointed the referees. Each editor can propose new members for the Editorial Board or referees. Editors and referees inactive for a longer period can be automatically replaced. Changes in the Editorial Board are confirmed by the Executive Editors. The coordination necessary is made through the Executive Editors who examine the reviews, sort the accepted articles and maintain appropriate international distribution. The Executive Board is appointed by the Society Informatika. Informatica is partially supported by the Slovenian Ministry of Science and Technology. Each author is guaranteed to receive the reviews of his article. When accepted, publication in Informatica is guaranteed in less than one year after the Executive Editors receive the corrected version of the article. Executive Editor - Editor in Chief Anton P. Železnikar Volariceva 8, Ljubljana, Slovenia s51em@lea.hamradio.si http://lea.hamradio.si/~s51em/ Executive Associate Editor - Managing Editor Matjaž Gams, Jožef Stefan Institute Jamova 39, 1000 Ljubljana, Slovenia Phone: +386 1 4773 900, Fax: +386 1 219 385 matjaz.gams@ijs.si http://ai.ijs.si/mezi/matjaz.html Executive Associate Editor - Technical Editor Mitja Luštrek, Jožef Stefan Institute mitja.lustrek@ijs.si Executive Associate Editor - Technical Editor Drago Torkar, Jožef Stefan Institute Jamova 39, 1000 Ljubljana, Slovenia Phone: +386 1 4773 900, Fax: +386 1 219 385 drago.torkar@ijs.si Editorial Board Suad Alagic (USA) Anders Ardo (Sweden) Juan Carlos Augusto (Argentina) Costin Badica (Romania) Vladimir Batagelj (Slovenia) Francesco Bergadano (Italy) Marco Botta (Italy) Pavel Brazdil (Portugal) Andrej Brodnik (Slovenia) Ivan Bruha (Canada) Wray Buntine (Finland) Hubert L. Dreyfus (USA) Jozo Dujmovic (USA) Johann Eder (Austria) Vladimir A. Fomichov (Russia) Janez Grad (Slovenia) Marjan Gušev (Macedonia) Hiroaki Kitano (Japan) Igor Kononenko (Slovenia) Miroslav Kubat (USA) Ante Lauc (Croatia) Jadran Lenarcic (Slovenia) Dimitris Kanellopoulos (Greece) Huan Liu (USA) Suzana Loskovska (Macedonia) Ramon L. de Mantras (Spain) Angelo Montanari (Italy) Pavol Nävrat (Slovakia) Jerzy R. Nawrocki (Poland) Nadja Nedjah (Brasil) Franc Novak (Slovenia) Marcin Paprzycki (USA/Poland) Alberto Paoluzzi (Italy) Gert S. Pedersen (Denmark) Ivana Podnar Žarko (Croatia) Karl H. Pribram (USA) Luc De Raedt (Germany) Dejan Rakovic (Serbia and Montenegro) Jean Ramaekers (Belgium) Wilhelm Rossak (Germany) Ivan Rozman (Slovenia) Sugata Sanyal (India) Walter Schempp (Germany) Johannes Schwinn (Germany) Zhongzhi Shi (China) Oliviero Stock (Italy) Robert Trappl (Austria) Terry Winograd (USA) Stefan Wrobel (Germany) Konrad Wrona (France) Xindong Wu (USA) Publishing Council: Tomaž Banovec, Ciril Baškovic, Andrej Jerman-Blažic, Jožko (Ćuk, Vladislav Rajkovic Board of Advisors: Ivan Bratko, Marko Jagodic, Tomaž Pisanski, Stanko Strmcnik A Novel Roll-Back Mechanism for Performance Enhancement of Asynchronous Checkpointing and Recovery Bidyut Gupta and Shahram Rahimi Department of Computer Science, Southern Illinois University Mail Code 4511, Carbondale, IL 62901-4511, USA {bidyut, rahimi}@cs.siu.edu Yixin Yang Department of Biological Sciences, Emporia State University Emporia, KS 66801, USA yyang@emporia.edu Keywords: asynchronous checkpointing, recovery, maximum consistent state Received: May 26, 2006 In this paper, we present a high performance rec^^ery algorithm for distributed systems in which checkpoints are taken asynchronously. It offers fast determination of the recent consistent global checkpoint (maximum consistent state) of a distributed system after the system recovers from a failure. The main feature of the proposed recovery algorithm is that it avoids to a good extent unnecessary comparisons of checkpoints while testing for their mutual consistency. The algorithm is executed simultaneously by all participating processes, which ensures its fast execution. Moreover, we have presented an enhancement of the proposed recovery idea to put a limit on the dynamically growing lengths of the data structures used. It further reduces the number of comparisons necessary to determine a recent consistent state and thereby reducing further the time of completion of the recovery algorithm. Finally, it is shown that the proposed algorithm offers better performance compared to some related existing works that use asynchronous checkpointing. Povzetek: Opisan je izboljšan postopek okrevanja v porazdeljenih sistemih. To determine consistent global checkpoints, two 1 Introduction fundamental approaches have been reported in the literature [1]-[9]. These are synchronous and Checkpointing and rollback-recovery are well- asynchronous approaches. In the synchronous approach, known techniques for providing fault-tolerance in processes involved coordinate their local checkpoint distributed systems [1]-[5]. The failures are basically actions such that the set of all recent checkpoints in the transient in nature such as hardware error [1]. Typically, system is guaranteed to be consistent. Although it in distributed systems, all the sites save their local states, simplifies recovery it has the following disadvantages: known as local checkpoints. A11 the local checkpoints, (1) additional messages need to be exchanged by the one from each site, collectively form a global checkpoint. checkpointing algorithm when it takes each checkpoint; A globa1 checkpoint is consistent if no message is sent (2) synchronization delay is introduced during normal after a checkpoint of the set and received before another operation [5]. In the asynchronous approach, processes checkpoint of the set [2]-[4], that is, each message take checkpoints independently without any recorded as received in a checkpoint should also be synchronization among them. Therefore, it is the simplest recorded as sent in another checkpoint. In this context, it form of taking checkpoints. However, because of the may be mentioned that a message is called an orphan absence of synchronization there is no guarantee that a message if it is recorded as received in a checkpoint, but set of local checkpoints taken will be a consistent set of not recorded as sent in another checkpoint. The local checkpoints. That is, there may exist orphan messages checkpoints belonging to a consistent global checkpoint between the local checkpoints. In order to get rid of the wil1 be termed in the present work as globally consistent orphan messages while determining the GCCs, processes checkpoints (GCCs). After recovery from a failure have to rollback. In such a situation, rolling back one processes in a distributed computation restart their process causes one or more other processes to roll back. computation from a consistent global checkpoint /state This effect is known as the domino effect [5]. This is the (CGS) of the system, i.e. from their respective GCCs. It main drawback of the asynchronous approach. So, a may be noted that a consistent global checkpoint of a recovery algorithm has to search for the most recent system is termed as a recent or a maximum one if, after consistent set of checkpoints before the system restarts the system recovers from a failure, the number of events its normal operation. Therefore, the recovery process is (states) rolled back at each processor is a minimum [6]. quite complex while the checkpointing scheme is much simpler compared to the same in synchronous approach. 2 Related Works In this work, we have considered asynchronous checkpointing approach because of its simplicity in taking checkpoints. So, in this section we state briefly the contributions of some noted related works. When processes take checkpoints independently, some or all of the checkpoints taken may be useless for the purpose of constructing consistent global checkpoints. A set of checkpoints can belong to the same consistent global snapshot if no zigzag path (Z-path) exists from a checkpoint to any other checkpoint [15]. In other words, absence of a Z-path means absence of any orphan message. A theoretical framework for characterizing quasi-synchronous algorithms has been presented in [12]. Quasi-synchronous checkpointing algorithms reduce the number of useless checkpoints by preventing the formation of noncausal Z-paths between checkpoints and advance recovery line. "Advancement of recovery line" is interpreted as follows: the more the recovery line is advanced, the less is the amount of computation to be redone by processes after the system of processes restart their normal operation; meaning thereby the reduction in the amount of rollback per process after the system recovers from failure. Depending on the degree to which the non causal Z-paths are prevented, quasi-synchronous checkpointing algorithms are classified into three classes namely [12], Strictly Z-Path Free (SZPF), Z-Path Free (ZPF), and Z-Cycle Free (ZCF). Manivannan and Singhal [13] have presented a quasi-synchronous checkpointing algorithm which allows the processes to take checkpoints asynchronously and reduces the number of useless checkpoints by forcing processes to take additional checkpoints. In this checkpointing algorithm, each process maintains a counter which is periodically incremented and the time period is same in all the processes. When a process takes a checkpoint, it assigns the current value of its counter as the sequence number for the checkpoint. Each message is equipped (i.e. piggybacked) with the sequence number of the current checkpoint. If the sequence number accompanying the message is greater than the sequence number of the current checkpoint of the process receiving the message, then the receiving process takes a checkpoint and assigns the sequence number received in the message as the sequence number to the new checkpoint and then processes the message. Quasi-synchronous checkpointing algorithm makes sure that none of the checkpoints taken lies on a Z-cycle in order to make all checkpoints useful. Asynchronous recovery algorithms are also presented in this paper based on the checkpointing algorithm. A failed process needs to roll back to its latest checkpoint and requests other processes to rollback to their consistent (latest) checkpoints. The work claims to be free from any domino effect. However, arguably this work is more of a synchronous approach than an asynchronous approach; partly because all processes have identical time periods to take checkpoints, and checkpoint sequence numbers are used so that all the i'h checkpoints of all processes are taken at the same time (i.e., logically at same time). Hence, we argue that there is no question of domino effect as this work is not at all an asynchronous approach. Gupta et al. [11] have proposed a hybrid roll forward checkpointing/recovery approach. Processes take checkpoints periodically and these time periods are different for different processes. Periodically, in absence of any failure, an initiator process invokes the algorithm to advance the recovery line; the duration of this period is assumed to be much larger than the time period of any individual process. Therefore, the domino effect is limited by this time period. The main advantages of this work are that each process may need to keep at most two checkpoints at any time, processes participate in the algorithm simultaneously ensuring re-execution time after a failure is limited by the period of execution of the algorithm, and finally, recovery is as simple as in the synchronous checkpointing/recovery approach. Ohara et al. [14] proposed an uncoordinated checkpointing algorithm for finding a recovery line where a given checkpoint is the earliest. In this algorithm, each process maintains a set of all local checkpoints on that process in a local vector. All local checkpoints which are just behind a given checkpoint are initially assumed to form a consistent global checkpoint. The algorithm checks happened-before relation for any coupled local checkpoints belonging to an ordered global checkpoint set. If there exists any happened-before relation, it replaces a local checkpoint with a successive local checkpoint of the same process. The algorithm may end by either finding a recovery line or running out of local checkpoints to be replaced. Venkatesan and Juang [16] presented an asynchronous checkpointing algorithm where each process take checkpoints independently and keeps track of the number of messages it has sent to other processes as well as the number of messages it has received from other processes. The algorithm is initiated by the process which fails and is recovered from thereafter or when it learns about process failure. During its each iteration, a process needs to compare the number of messages received by it and the actual number of messages sent by the other process, at each of its checkpoint starting from the most recent one. The received vectors corresponding to all the checkpoints including the current one and the one where next iteration starts, need to be fetched from the storage in order to decide the checkpoint for the next iteration to start with. 3 System Model The distributed system has the following characteristics [1], [6], [10]: 1. Processes do not share memory and they communicate via messages sent through channels. 2. Channels are made virtually lossless and order of the messages is preserved by some end-to-end transmission protocol. 3. When a process fails, all other processes are notified of the failure in finite time. We also assume that no further processor (process) failures occur during the execution of the algorithm. In fact, the algorithm must be restarted if there are further failures. Below we state the problem considered in this work. Problem Formulation: In this work, we have considered asynchronous checkpointing approach because of its simplicity in taking checkpoints. That is, processes take checkpoints periodically and each process determines independently its time period of taking its checkpoints. So, different processes may have different time periods for taking their checkpoints. After the system recovers from a failure, processes start from the recent consistent state of the system. However, the main drawback of this approach is that determining a consistent global checkpoint may involve a very large number of pairwise comparisons of checkpoints belonging to different processes because of the presence of a possible domino effect. In absence of any hybrid approach [11], in the worst case, all checkpoints of all processes may have to be compared. However, asynchronous checkpointing approach is suitable for highly reliable systems where failures occur very seldom. In this work, our objective is to design an efficient recovery algorithm that will reduce considerably the number of unnecessary pairwise comparisons of checkpoints while determining a consistent global checkpoint. In other words, our objective is to identify a priori the checkpoints that can not be the GCCs so that we can exclude these checkpoints from comparison resulting in a fast determination of a recent consistent global checkpoint (state) of the system. Note that an initial version of this work has appeared in [17]. 4 Data Structures Let us assume that the distributed system under consideration consists of n processes. Each process Pi maintains a vectors Vi of length n. The Vi vector records the number of messages process Pi has sent to every other process with the exception that the element vi,i (=Vi (i)), i.e. the number of messages process Pi has sent to itself will be always zero. The Vi vector is described below: Vi = [vi,0, vi,1,.....,vi,i,.....,vi,n-1] where vij = Vi (j) and represents the number of messages sent by process Pi to process Pj, and vii is always zero. All entries in Vi are initialized to zero. Each time process Pi decides to send a message m to process Pj, then Vi(j) is incremented by one. This facilitates process Pi to know how many messages it has sent to process Pj. In this work, Cj,r represents the rth checkpoint taken by process Pj. Sometimes when mentioning the checkpoint number is irrelevant, we simply use Cj to denote a checkpoint taken by Pj. Each process Pi also maintains a linear list Ri of dynamically growing length. At any given time t, the length of the list Ri (i.e. the number of the entries in the list) is equal to the number of checkpoints taken by Pi till time t. For example, the length of the list is 3 at the 3rd checkpoint of process Pi where as its length will be 4 at its 4th checkpoint and so on. The list Ri is described as Ri = [ ri,1, ..... ,ri,r, ......,], where ri,r = Ri (r) and represents the number of messages received by process Pi from all other processes till its rth checkpoint. Each such list is initially empty. Each process stores its vectors and the lists together with the corresponding checkpoints in stable storage. Also copies of the lists and the vectors are stored in the respective local memories of the processors running the processes. It offers their faster access than to access them from stable storage whenever possible. In addition, each process maintains a Boolean flag. This flag is used to convey some specific information (described later). 5 Observations Consider the system of three processes P1, P2, and P3 as shown in Fig. 1. The vectors V1, V2, and V3 initially have all their entries set to zero. The lists R1, R2, and R3 are initially all empty. By the time process P1 takes its first checkpoint C1,1, it did not send any message to P2 or P3. So its V1 vector is [000]. Also, process P1 received one message before it took its first checkpoint; so now the list R1 has one entry, i.e. R1 = [1]. By the time process P1 takes its second checkpoint C1, 2, it has already sent one message to P2. So it increments V1(2) by 1 and the vector V1 is now = [010]. Also, process P1 has not received any messages (from P2 or from P3) before it takes its second checkpoint. So the list Ri at Ci,2 is [1,1]. In the same way, the vector and the list are updated at each checkpoint of each process. This example will be used later in this paper to illustrate the working principle of our proposed algorithm. We assume that a process Pi after recovery from its failure acts as the initiator process, i.e., Pi is responsible for invoking the recovery algorithm. To start with Pi sends a message requesting all Pj, 0 < j < n-1, j f i, to send to it their respective Vj vectors corresponding to their latest checkpoints. Upon receiving the request, every process Pj sends its Vj to Pi. After receiving the vector Vj from all processes the initiator process Pi forms a two dimensional array Vn . It is written below. Vn - 0 ^JJO Vai 0 VJJI Van-1 V], n-l ^n-l 0 where the jth row represents Vj, 0 < j < n -1. The initiator process then computes the column sums to create the following vector: Vc = [vc0, vc1, vcj, ^ ,vcn-1] where vcj = column sum of the entries of the jth column of VN and is given as vcj = VcCÌ) = IVn (i , j), for i = 1 to n. Therefore, Vcj represents the total number of messages sent to process Pj by all other processes as recorded in each sending process' latest checkpoint. The initiator process Pi then unicasts Vcj (= VcCj)) to process Pj. After receiving vcj from Pi, each process Pj computes Dj = Rj(r) - vcj, assuming that the last checkpoint of process Pj is the r'h checkpoint (Cj,r). The difference Dj (if >0) gives the exact number of orphan messages received by a process Pj till its checkpoint Cj,r, from all other processes in the system. Initiator process Pi also does similar computation to determine the exact number of orphan messages (if any) it has received till its latest checkpoint Ci,r. Proof of this statement is given later. V, V, V, V, R R, R,= R,=1,1 Figure 1: Vectors (Vi) and lists (Ri) for i = 1, 2, and 3 Observe that for every process Pj, vcj and Rj(r) may not be identical, because some of the sent messages (recorded already by the sending processes at their respective latest checkpoints) may not have arrived yet at Pj (i.e. vcj > Rj(r)), or some of the received messages (by Pj) may not have been recorded at the latest checkpoints of some sending processes because these messages may have been sent after their latest checkpoints (i.e. vcj < Rj(r)). Assume that the last checkpoint of process Pj is the r'h checkpoint (Cj,r) and Dj is greater than zero (Dj >0). Search in the list Rj is performed backwards, starting with its last component. Thus, we search the proceeding entries of the list Rj from Rj (r) till the first Rj (m) so that Rj (r) - Rj (m) > Dj , (m < r). Then, the checkpoints Cj,r, ^, Cj, m+1 are excluded from the consideration of GCC composition, i.e. these checkpoints will be skipped. So, now we start from the checkpoint Cj,m of process Pj. The vector Vj at checkpoint Cj,m along with the Boolean flag "1" are sent to the initiator process Pi for the computation of the next iteration. In the next iteration, if Dj is smaller than or equal to zero CDj < 0), which means that process Pj has not received any orphan message till the checkpoint Cj,r. process Pj will send the flag "0" to the initiator process Pi . The initiator process Pi will use the vector Vj at Cj,r for the computation of the next iteration. Initiator process Pi is also involved in similar computation like any other process Pj to determine its appropriate vector Vi needed for the computation of the next iteration. This will be repeated until all processes send "0" flags to the initiator process Pi and Pi's own flag is also 0 . Then the initiator process Pi will notify all processes to rollback to their respective latest checkpoints at which their corresponding flags have the value 0 each. Thus, this set of checkpoints is a globally consistent checkpoint (proof is given later). The following observations are necessary for designing the recovery algorithm. Lemma 1: Let Cj,r be the latest checkpoint of process Pj at time t. If Dj > 0, then process Pj has received a total Dj number of orphan messages from other processes. Proof: Rj(r) represents the total number of messages process Pj has received so far from all other processes till time t. Also vcj represents the total number of messages sent by all other processes to Pj as recorded in their latest checkpoints. Therefore Dj > 0 means that at least some process Pi (i 4- j) has sent some message(s) to Pj after taking its latest checkpoints. It also means that the sending processes have not yet been able to record these Dj messages. Since all such Dj messages have been received and recorded in Pj's latest checkpoint, but remain unrecorded by the sending processes, therefore Pj has received Dj number of orphan messages from the rest of the processes with respect to the checkpoint Cj,r. ■ Lemma 2: If Dj < 0, process Pj has not received any orphan message. Proof: Dj = 0 means that the number of messages received by Pj is equal to the number of messages sent to Pj and these sent (also received) messages have already been recorded by the sending processes in their latest checkpoints. Therefore the received messages can not be orphan. Also, Dj < 0 means that the number of the messages received by Pj is less than the number of messages sent to it. Now vcj is the total number of messages sent by all other processes to Pj as recorded in the latest checkpoints of the sending processes. It means that all messages received by Pj have already been recorded by the senders. Hence none of such received messages can be an orphan. Hence the proof follows. ■ Lemma 3: Let Dj > 0 at the checkpoint Cj,r of process Pj and let m denote the largest integer that satisfies Rj(r) -Rj(m) > Dj (m < r). Then none of the checkpoints Cj,r, Cj,r-,, Cj, m+1 belongs to the set of the globally consistent checkpoints. Proof: Because m is the largest integer that satisfies Rj(r) - Rj(m) > Dj (m < r), the relation Rj(r) - Rj(i) < Dj is established for any i ( m+1 < i < r). Moreover, according to Lemma 1, Pj has received exactly Dj number of orphan messages from all other processes. So there must be at least one orphan message received by process Pj before Cj,r, and the same also is true before every checkpoint between Cj,r and Cj,m. Hence, none of the checkpoints Cj,r, Cj,r-1, Cj, m+1 can belong to the set of the globally consistent checkpoints. ■ V, Fa Theorem 1: Given a set S* = {Cj,r} of n checkpoints, one from each Pj, 0 < j < n -1, if for every checkpoint Cj,r, its corresponding Dj < 0, then S* is the set of the globally consistent checkpoints. Proof: Since Dj < 0, for each process Pj, ( 0 < j < n -1) at its checkpoint Cj,r e S*, therefore, all received messages by any such process Pj have already been recorded as sent by the sending processes in their corresponding checkpoints. Hence, according to Lemma 2 none of the messages received by process Pj is an orphan message. This is true for all processes. Therefore, the system of n processes does not have any orphan messages with respect to the checkpoints of the set S*. Hence the set S* is the set of globally consistent checkpoints. ■ Before we present the algorithm formally, we give an illustration of its working principle using the example of Fig. 1. An illustration: Suppose a failure 'f' occurs on the processor running the process P1. The process P1 that became faulty, acts as the initiator after recovery from failure. After the system recovers from the failure, to start with, initiator process Pi broadcasts a request asking the other two processes P2 and P3 to send their respective vectors V2 and V3 corresponding to their latest checkpoints C2,1, and C3,2. In this example, the three latest checkpoints of processes P1, P2, and P3 before the failure occurs are C1,5, C2,1, and C3,2. The respective vectors V1, V2, and V3 at the three latest checkpoints are [010], [100] and [020]. After receiving all these vectors, P1 (it becomes the initiator after recovery from failure) forms a two dimensional array VN. It is written below: 0 1 0 VN = 1 0 0 0 2 0 P1 creates the vector VC = [130] and unicasts vcj to each process Pj, for j = 1, 2, and 3. After receiving vcj from Pi each process Pj computes Dj ( = Rj(r) - vcj) (assuming the last checkpoint of Pj is the r'h checkpoint) to determine the total number of orphan messages (if any) it has received with respect to its latest checkpoint and also Pi does the same. The lists R1, R2, and R3 at the latest checkpoints (C1,5, C2,1, and C3,2) of processes P1, P2 and P3 are [1,1,2,4,5], ' [2] and [0,1] respectively. P1 finds that D1 = (5-1) = 4; so it has received 4 orphan messages. It calculates the difference between R1(5) and R1(2) and finds that R1(5) - R1(2) = 4 = D1; so process P1 now considers the vector V1 (= [010]) at C1,2 along with a flag "1" for the computation of the next iteration. P2 finds that it has not received any orphan message because D2 = (2-3) < 0. So it sends the same vector [100] and a flag "0" to P1. Process P3 finds that D3 = (1-0) = 1; so it has received an orphan message. It calculates the difference between R3(2) and R3(1) and finds that R3(2) - R3(1) = 1 = D3; so process P3 now sends the vector V3 (= [010]) at C3,1 along with a flag "1" to P1 for the computation of the next iteration. In the second iteration, P1 forms the following two dimensional array. 0 1 0 Vn = 1 0 0 0 1 0 P1 creates the vector VC = [120] and unicasts vcj to process Pj, for j = 1, 2, and 3. P1 finds that it has not received any orphan message because at C1,2, its D1 = 1 - 1 = 0. So, it sets its flag to 0. P2 also finds that it has not received any orphan message because at C2,1, its D2 = 2 - 2 = 0; and it sends the flag "0" to P1. Similarly, P3 finds that it has not received any orphan message because at C3,1, its D3 ( = R3(1) - vc3) = 0 - 0 = 0, and it sends a flag "0" to P1. Thus, P1 receives flag 0 from each process including its own flag set to 0. It then notifies each process to rollback to the current checkpoints corresponding to these flags (= 0). At this time, none of the processes needs to roll back further and hence P1 terminates the algorithm. Thus the algorithm terminates after two iterations. Therefore the GCCs belonging to the maximum consistent state are C1,2, C2,1 , and C3,1. It may be noted that in each iteration we need to fetch only the latest Rj for each process Pj and some Vj vectors (not all) to determine the GCCs. In each iteration, the checkpoints that can not be the GCCs are identified and their vectors Vj are not fetched at all. That is, the presented approach will not repeat its operation unnecessarily for these vectors corresponding to these non-GCCs. It definitely makes the approach fast and efficient. Observe what happens if we do not consider the above idea to determine the GCCs. It is stated below. First, C1,5, C2,1, and C3,2 are considered and compared pairwise to determine if they are globally consistent. Since C1,5 and C3,2 are not, so in the next iteration C1,4, C2,1, and C3,1 are considered pairwise. But C1,4 cannot be a GCC. Therefore C1,3, C2,1, and C3,1 are now considered. But since C1,3 can not be a GCC, therefore C1,2, C2,1, and C3,1 are now considered. This time it is found that these three checkpoints are globally consistent. Therefore four iterations for pairwise comparisons of three checkpoints, one from each process, are needed to determine the GCCs as opposed to only two when the approach presented in this work is followed. It also means that the number of trips to the stable storage for fetching checkpoints can also be reduced to a good extent in the proposed approach. It definitely makes our algorithm fast. Moreover when processes take large number of checkpoints before a failure occurs, our approach may offer even much better performance from the viewpoint of a possible large reduction in the number of iterations (i.e. the number of trips to stable storage as well) to determine the GCCs. As a result, the recovery scheme also will be faster. Besides, it is clear from the example that each process Pj simultaneously identifies the checkpoints that cannot be globally consistent and therefore these checkpoints should be skipped. This parallelism of the algorithm further enhances the speed of execution of the recovery approach. 6 Algorithm to Determine Globally Consistent Checkpoints In the following algorithm we assume that process Pi was faulty. So, it becomes the initiator of the recovery algorithm after it recovers from the failure. 6.1 Algorithm Recovery Input: Given the latest n checkpoints, one for each process Pj , 0 < j < n-1, for an n process system and the corresponding vectors Vj and lists Rj at these n checkpoints. Output: A set of globally consistent checkpoints (maximum consistent state of the system). The responsibilities of each participating process Pj and the initiator process Pi are stated in Fig. 2. Proof of Correctness: Each process Pj repeats its steps 1, 2, 3, and 4 to arrive at a checkpoint that has not recorded the receipt of any orphan message from the other processes (using the observations of Lemmas 1, 2, and 3). In other words, it identifies the checkpoints that can not belong to the set of the globally consistent checkpoints and skips them. This decision is taken by identifying a checkpoint Cj,m such that m is the largest integer that satisfies Rj(r) - Rj(m) > Dj (m < r). None of the checkpoints Cj,r, Cj,r-i, Cj, m+i can belong to the set of the globally consistent checkpoints and they are skipped. However, the initiator process Pi decides when to terminate the algorithm, i.e., when the checkpoints can become globally consistent. Process Pi checks to see if all processes send flags of 0, i.e. Dj < 0 for each process Pj. If so, the algorithm terminates according to Theorem 1. Note that the condition Dj < 0 must always occur during the execution of the algorithm. It may be observed that in the worst case, because of some typical communication pattern, the domino effect may force each process to restart from its initial state where for each process Pj we always have Dj = 0. Besides, since the algorithm starts with the latest checkpoints, the number of events (states) rolled back at each processor is a minimum. This is true because, in its Step 4 each process Pj skips only the checkpoints that are non GCCs. Thus the algorithm determines the maximum consistent state of the system as well. ■ 6.2 Advantages of the proposed approach The presented algorithm offers the following advantages. During its each iteration, each process Pj determines the checkpoints that can not be the GCCs. Therefore, the algorithm is able to avoid any unnecessary computations of VC corresponding to these non GCCs. The presented algorithm skips checkpoints that do not belong to the set of the globally consistent checkpoints; thus it avoids many unnecessary pairwise comparisons. It also means that the number of trips to the stable storage for fetching checkpoints can also be reduced to a good extent in the proposed approach. It definitely makes the algorithm fast and efficient. The simultaneous execution of the algorithm by all participating processes also contributes to the speed of execution of the algorithm. Besides, the algorithm can find the maximum number of checkpoints to be skipped by determining the largest integer m, which satisfies Rj(r) - Rj(m) > Dj. This guarantees significant reduction in the iterations of computation. 6.3 Performance Message complexity: Suppose the termination of the algorithm requires the construction of the vector VC by the initiator process Pi to occur k times (i.e. k number of iterations). During each such time every process in the n-process system exchanges a couple of messages with the initiator process Pi. Thus, O(n) messages are sufficient for each time. Thus, considering k times, the message complexity of the algorithm is O(kn). Besides message complexity, another factor that must be considered as a performance measure is the number of pairwise comparisons of the checkpoints among the processes that is needed to be performed by any asynchronous checkpointing/recovery approach. This is done in order to determine a consistent global state of the system. Obviously larger the number of such comparisons, larger is the execution time of the recovery algorithm. This has been discussed in the previous subsection. It may be noted that the number of such pairwise comparisons is also related to the number of times checkpoints are fetched from stable storage, i.e. the number of trips to the storage. The time spent on such trips may be substantial enough to affect to a good extent the speed of execution of any recovery algorithm. One possible solution may be to fetch a large number of checkpoints at a time. However, it may not be a good idea at all in many situations; for example, a process may end up in fetching too many when that many are not needed, or too little when more are needed. So, it becomes quite arbitrary about how many checkpoints should be fetched at a time. Therefore, it is wise to consider that a process will fetch one checkpoint at a time and in fact, this is true for all existing asynchronous checkpointing / recovery algorithms. In the following analysis we consider the fact that larger the number of pairwise comparisons of checkpoints, larger is the number of trips to stable storage, and therefore, larger is the execution time as a result. In our analysis we will not consider complexity due to message size, as most related works including ours use control messages of reasonably small size and all these works differ mainly in terms of the number of comparisons, number of iterations, and the number of control message needed to determine a consistent global state. It may be noted that computing this number of comparisons is not very straightforward because it depends solely on the nature of the distributed computations. However, we give an approximate analysis which may not be very accurate; still it will offer a clear understanding of the advantages of our algorithm iRiriatar process P|: Step 1 : It asks process Pj 0 It searches the list Ri till il Tiiids the lathes) Integer in {< r) itial salisficü Ri(r1 - R|(in) > D,- Then it sets its fla^ to ! and considers Vj t-omcspynding lo its chctkpyint Ci.m (i,c, Cij IS replaced by Cjj, ) fur the next iteration; /* Chtnrkpoittts Ctj, O^f, ..., aresktppc'ti V dsc Il its Ha^tuC and considers Vjal C;^ iùrthcnc:;\t iteration; Step 6: It receives the flag and Vj from eacJi procRHt P,; if (lag ■ 0 fur each proce« Pj, 0 < j < Pi asks each proeeiis P, lo restart ihe application program fiiyin its last checkpoint eorresponding lo which D^ ; P, ncicts iis vector Vj to a-ro and list R, to an empty list corresponding lo its rcstarlin^ chcckpoinl at which D^ < 0; It restarts coinpiJialion; /* tis responsibiiiiy itsstKidlt^ci wtft t/n^ üfjUoritHt» is finished V Giffbuih' CffHsisn'tii djitrfipoinis bftonging tv rhf müxtwttf» coiìsifient sttìw an' liviemii/tetl*/ dsn Conitol jluws to Slep 3; Truce» l^^: Step 1: Step 2: Step 3: Slep 4: Pj receives request from P,l if Pi has requested tu restart Pj resets its vector vj to aero and lii^t rj lo an empty lisi coirespondin^ tü lis restaninj^ chectpoiiU at whicli D^ < 0; ll restarts, compulation; else It sends Vjcorresponding to its btest checkpoint CjjW the itiiriatOT procesi P,; It receives v J ffnin Pj; It compute!! Diby calculating (rj^rl - v^*); ifDj>0 It searches the list R^ till it finds the larjtesi inte^ter m (< r) tliat satisfies Rj(r) - ftj(ni)> Dj, Then it sends a flag of I and Vjio P j corresponding to its checkpoint C,, q, (i,e, replaced by Cjj^ ); ChivtipfHHifi Cfj. Cj_j^,tttrcik/pptni */ else It sends a flag of 0 and V^at C^^ to the iniiialor process P,; Figure 2: The responsibilities of each participating process Pj and the initiator process Pi over some other noted asynchronous checkpointing / recovery approaches [14], [16]. It is stated below. Let the system consist of n processes. For simplicity we assume that after a failure occurs and the system recovers from it, each process will skip on an average its latest (r-1) checkpoints to restart its computation. Thus a process Pj will skip its latest (r-1) checkpoints Cj,m+2, , Cj,r+m- We also assume that the set {C0,m+1, C1,m+1, ^,Cn-, represents the globally consistent checkpoint (maximum consistent state) of the system and our algorithm will determine it in k number of iterations. In this simple model, we consider a recovery approach associated with asynchronous checkpointing scheme in which the pairwise comparisons to determine checkpoints' consistency involves first the checkpoints of the set {C0,m+r, , Cn-1,m+r}, followed by the set {C0,m+r-1, , Cn-1,m+r-1}, — and so on, and finally the set {C0,m+1, —., Cn-1,m+1}which is the globally consistent state. Therefore, the total number of comparisons is given by [r x{n(n-1)}/2]. Note that this may not be the exact way to perform the comparisons in a particular case; still it offers a clear view of how complex it can be. In general, a checkpoint(s) in one set may also have to be compared with a checkpoint(s) in another set. On the other hand, not necessarily all checkpoints in a set may be needed to be pairwise compared. It depends on the nature of the distributed computations. So the actual number of comparisons may be larger or smaller than the number [rx {n(n-1)}/2]. Anyway, it is clear that this number is much larger than the total number of comparisons kxn, offered by our approach, where n is the number of parallel comparisons to test if Dj > 0 in each iteration and 1 < k < r. Observe that in the worst case, the number of comparisons of the proposed approach may become [rx {n(n-1)}/2]. Below we have compared the performance of our approach with the approaches in [14], [16]. 6.3.1 Comparison with Ohara et. al. [14] Ohara et al. [14] have proposed an asynchronous approach for finding a recovery line where a given checkpoint is the earliest. All the local checkpoints which are just behind a given checkpoint are initially assumed to form a consistent global checkpoint. In this algorithm, happened-before relations are checked for every coupled local checkpoints belonging to an ordered global checkpoint set. If there exists any happened-before relation, it replaces a local checkpoint with a successive local checkpoint of the same process. The algorithm may end by either finding a recovery line or running out of local checkpoints to be replaced. This leads to exhaustive comparisons of happened before relations for every coupled local checkpoints. The number of such comparisons is approximately [rx {n(n-1)}/2] as calculated earlier. In our algorithm, it skips the checkpoints that do not belong to the set of the globally consistent checkpoints. Thus, our algorithm reduces to a good extent unnecessary pairwise comparisons of the checkpoints to determine global consistent checkpoint of the system. Performance comparison of the above mentioned approach [14] and our approach is shown in Fig. 4. Fig. 3 illustrates how the number of comparisons is affected with the increase in the average number of checkpoints per process (r) in the asynchronous approach [14] and in our approach. Fig. 4 shows the variation of the number of comparisons with the increase in the number of processes (n). Both figures highlight the advantages offered by our approach, i.e. considerable amount of reduction in the number of comparisons in our approach. It helps the processes to restart their computation related to the distributed application much faster after the system recovers from a failure. 3000 w = 2500 S ■= 2000 I = 1500 5 1000 i 500 0 t I 5 10 15 20 25 30 35 40 45 50 r ■ The Propw ed Approach — Ohara ei al Approach Figure 3: Number of comparisons vs. the average number of checkpoints per process (r). 6.3.2 Comparison with Venkatesan et. al. [16] Venkatesan and Juang [16] presented an asynchronous checkpointing algorithm where each process takes checkpoints independently and keeps track of the number of messages it has sent to other processes as well as the number of messages it has received from other processes. The existence of orphan messages is discovered by comparing the number of messages sent and received. The algorithm is initiated by the process when a failure occurs or when it learns about process failure. Figure 4: Number of comparisons vs. the number of processes (n). During its each iteration, a process needs to compare the number of messages received by it and the actual number of messages sent by the other process, at each of its checkpoints starting from the recent one. The received vectors corresponding to all the checkpoints including the current one and the one where next iteration should start, need to be fetched from the storage in order to decide the checkpoint for the next iteration to start with. It means that the number of trips to the storage for fetching the information related to the received message (for the purpose of comparison) will be equal to the number of checkpoints starting from the current checkpoint all the way to the checkpoint where the next iteration should start. Figure 5: Number of control messages vs. the number of processes (n). In our algorithm, the decision about the checkpoint at which the next iteration should start is based on the R-vector at the recent checkpoint only. This algorithm skips checkpoints that do not belong to the set of the globally consistent checkpoints by examining this R-vector only. Therefore, in order to determine the checkpoint for the next iteration to start with, the number of trips to the storage is only one per iteration. This means that the total number of trips to complete the execution of our algorithm is reduced to a good extent compared to that in [16]. We now compare the two algorithms based on the number of control messages needed to execute the respective algorithms. In [16], in each iteration, for an n-process system n(n-1) messages are exchanged among the processes. Thus, O(n2) messages are exchanged in each iteration. In our algorithm, 3(n-1) messages are exchanged in each iteration. Thus, O(n) messages are sufficient in each iteration in our algorithm where n is the number of processes in the system. Fig. 5 shows the message complexity comparison of the two algorithms with the increase in the number of processes. This figure clearly shows the advantage offered by our algorithm over the one in [16]. 7 Further Enhancement We have seen that the linear list Rj maintained by a process Pj increases dynamically. If the application program has large execution time and there is seldom any failure during its execution, the length of the lists may become too large; thereby it may consume considerable amount of memory. To solve this problem, i.e. to keep the list from growing too much we will propose a simple solution in this section. The following operation is needed in the implementation of the idea. We define the subtraction operation on two vectors Vj of process Pj at its two checkpoints Cj,m and Cj,s with (s > m) as follows: Vj at Cj,s - Vj at Cj,m = [(vj,0 at Cj,s - vj,0 at Cj,m), ^ , (vj, n-1 at Cj,s - vj, n-1 at Cj,m)] = [(vj,p at Cj,s - vj,p at Cj,m)] for 0 < p < n-1 We now state the basic idea to keep the growing lengths of the lists in control. This idea has been used in designing the enhanced recovery algorithm stated later in this section. It may be noted that the recovery algorithm stated earlier does not consider the use of this idea. In absence of any failure an algorithm runs periodically (say the time period is T which should be much larger than the time period of any individual process) to put a limit on the length of the R-vector. The lengths of the lists (R-vectors) may then be limited by the number of checkpoints taken by the processes during the time interval (T) between two successive executions of the algorithm. Besides in doing so, this also advances the recovery line in the event that a recent recovery line exists other than the one found during the previous execution of the algorithm. In effect, the number of comparisons of the checkpoints to determine a recent consistent state may also drastically reduce since there is a possibility that the algorithm will consider in a particular run only the checkpoints which the processes take during the interval T. Therefore, this enhanced algorithm, in general, may take much less time to complete its execution compared to Algorithm Recovery. Also note that at the completion of the l'h execution of the algorithm a process Pj will have in stable storage only its recent globally consistent checkpoint, say Cj,m and any other checkpoint (s) it has taken thereafter and prior to the start of the l'h periodic execution of the algorithm. In describing the following two rules for updating the lists Rj and the vectors Vj of a process Pj we have assumed that the latest globally consistent checkpoint of process Pj is Cj,m as determined by the l'h execution of the algorithm and it has taken (k-m) more checkpoints thereafter and prior to the start of the lth periodic execution of the algorithm. Rule 1: Updated Rj at Cj,m = {} and updated Vj at Cj,m = [00_0] Rule 2: Updated Rj at Cj s for (m+1 < s < k) = [(Rj(m+1) - Rj(m)), ^ , (Rj(s) - Rj(m))], and Updated Vj at each Cj,s = [(vj,p at Cj,s - vj,p at Cj,m)] for 0 < p < n-1 When we implement the above idea of reducing the lengths of the lists, either of the following two approaches can be adopted: Approach 1: When a failure occurs and the system recovers from the failure, the algorithm is run again in spite of its periodic execution, with the hope that a recent (maximum) consistent state may be found which is not identical to the one determined by its last periodical execution. In such a situation the time to complete the application will be less because of the advancement of the recovery line. On the other hand, if such a situation as mentioned above does not exist, the algorithm will output the same consistent state as determined in its last periodic execution. In this case, however, the application will take an additional amount of time equal to the execution time of the algorithm for its completion. Approach 2: After the system recovers from a failure all processes restart from their respective globally consistent checkpoints which have already been determined by the algorithm's last periodic execution prior to the occurrence of the failure. The recovery becomes as simple as that in a synchronous approach. However, since this approach does not look for the possible existence of a recent consistent state other than the already existing one, therefore the time to complete the application may increase. Observe that irrespective of which approach is followed, the next periodic execution of the algorithm will occur T time units after the system restarts. About when to apply a specific rule, Rules 1 and 2 will be implemented when the algorithm runs periodically in absence of any failure. Rule 1 is also implemented when determination of a consistent global state of the system is needed after the system recovers from a failure (Approach 1). In the following algorithm we have considered a combination of the two approaches. For the selection of an initiator process for running the algorithm periodically, we consider that each process Pi maintains a local CLKi variable which is incremented at periodic time interval T. It also maintains a local counter denoted as counteri , initially set to 0 and is incremented by process Pi during its turn to initiate the recovery algorithm. Thus, a process on its own determines if it is its turn to initiate the execution of the algorithm. In this context, observe that the set of GCCs is unique and is independent of the initiator process. We state below how a process Pi does it before we formally state the algorithm: Selection of an initiator process: At each process Pi (0 < i < n-1): If CLKi = (i+(counteri*n))*T counteri= counteri+1; /*When its turn to initiate the recovery algorithm, i.e., Pi becomes the initiator*/ Algorithm Recovery - Enhanced: Input: Given the latest n checkpoints, one for each process Pj , 0 < j < n-1, for an n process system and the corresponding vectors Vj and lists Rj at these n checkpoints. Output: A set of globally consistent checkpoints (maximum consistent state of the system). The responsibilities of the initiator process Pi and each participating process Pj are stated in Fig. 6. An example: Consider the system as shown in Fig. 7. Ignore the presence of the failure 'f' for the time being. Suppose that the periodic execution of algorithm starts immediately after processes P1 and P3 take their respective checkpoints Ci,5 and C3,2. The algorithm determines the latest consistent global checkpoint of the system. It is {C1,2, C2,1, C3,1}. The two rules are applied to update the lists R1, R2, and R3, and the vectors V1, V2, and V3 at the checkpoints of processes P1, P2, and P3 starting from their respective latest globally consistent checkpoints, which are namely C1,2, C2,1, C3,1. The system with the updated lists and vectors is shown in Fig. 8. The checkpoints shown in Fig. 8 are the only ones saved in stable storage. Now assume that a failure 'f' has occurred. Therefore the algorithm determines the consistent global checkpoint of the system, which is {C1,2, C2,1, C3,1} and applies only Rule 1 to reset the vectors to zero and to make the lists empty at the respective GCCs of the three processes. R1= R1=1, R3 = 0, V3 R3 = Figure 7: Before the execution of the algorithm The system in this situation is shown in Fig. 9. The three respective consistent checkpoints are the only ones saved in the stable storage at this time. Note that the consistent global state remains the same (see Figs. 8 and 9). This is the situation when time to complete the application program increases by an amount equal to the time to execute the recovery algorithm. This has been pointed out earlier in the description of Approach 1. However, this will not happen if only Approach 2 is followed for recovery. R R R1 Fa 0 V V V V R R R R,= 1, R , R3= 01, V3 Figure 8: After the execution of the algorithm in absence of any failure R1= {}, R2 ={}, R3= {}, V3 Figure 9: The system restarts from its consistent global state {C1,2,C2,1,C3,1}after recovery. 7.1 Comparison with [11] and [13] Gupta et al. [11] have proposed a roll-forward hybrid checkpointing / recovery scheme using basic checkpoints. The direct dependency concept used in the communication-induced checkpointing scheme has been applied to basic checkpoints to design a simple algorithm to find a consistent global checkpoint. They have used the concept of forced checkpoints that ensures a small re-execution time after recovery from a failure. This scheme has the advantages of simple recovery as in synchronous approach and simple way to create checkpoints like in asynchronous approach. Our proposed approach (enhanced version) is not a hybrid approach. It runs periodically only to put a limit on the size of the R-vectors. This is the primary objective of the enhanced approach. In doing so it may come out with a recent recovery line that is different from the one found during the last execution of the algorithm. Thus, effectively as mentioned earlier, even though the proposed algorithm is not a hybrid one, still as in [11] it may reduce drastically the number of comparisons needed to identify a recovery line, as well as it may limit the domino effect by the time period T, based on the message communication pattern among the processes. Our proposed approach is quite different from the work in [13] in that in our approach processes take checkpoints completely independently based on their individual time periods that are different for different processes. In [13], processes take checkpoints with the same time periods and they make sure that there is no orphan message between any two ith checkpoints of two processes. Therefore, it is more of a synchronous approach than an asynchronous approach, where as our approach is purely an asynchronous approach. 8 Conclusions In this paper we have presented an efficient recovery algorithm for distributed systems. Asynchronous checkpointing scheme has been considered because of its simplicity in taking checkpoints. The main feature of the recovery algorithm is that to determine a maximum consistent state, the algorithm in its each iteration does not need to compare all the vectors at all the checkpoints of the processes. In its each iteration the algorithm identifies and skips those checkpoints that can not belong to the set of the globally consistent checkpoints. It not only reduces the computational overhead to a good extent, but also the number of trips to the stable storage for fetching checkpoints is reduced compared to the works in [14] and [16], and as a result its execution becomes even faster. In this context, it may be noted that in any algorithm that uses asynchronous checkpointing, there is always some computational time wasted to create process checkpoints that later do not belong to CGS and this problem can not be avoided. This is true for our proposed algorithms as well. Besides, it is executed simultaneously by all participating processes while determining a maximum consistent state. It further contributes to its speed of execution. We have also proposed a simple enhanced asynchronous recovery scheme to control the dynamically growing length of the lists. In effect, the number of comparisons of the checkpoints to determine a recent consistent state may also drastically reduce and based on the communication pattern among the processes it may limit the domino effect by the time period T. Even though we do not apply any hybrid checkpointing scheme [11], still this approach offers the option to achieve a recovery scheme which is as simple as the approach proposed in [11]. In this context, it may be noted that if the system model changes such that order of the messages sent through the channel cannot be preserved, it will adversely affect the processing time, because a process must wait to receive message m1 before processing its already received message m2. Here, we have assumed that the proper order is m, followed by m2. Our future work is directed at the new challenging area of designing recovery schemes for cluster federation computing environment in which different clusters may adopt different ways for checkpointing, for example, some may apply coordinated approach, where as other may apply asynchronous approach [18 ], [19 ]. [tiÌEÌal[>r nrnces.11'.: Step I : i r I hi; al^uiil hm i s cxccutcd un failure recovciyonfailurc = 1; disi? nct-ovurj' on failunc = 0; Il aslts ci'cry piutc^s P, ^ i) tu stnd its Vj cunviipondìng lo its latest tlicekpoint Stup 2 Step 3 Step 4 Step It Kiwivcs all V, fur 0 < j ^ n-1 ; It cnmpules Vt = Vc^Vt'Vt^ ... Vt"^': It unicasts Vi-' to eicli Pj, ftiT j ti compule« D| hy cilculatinj; ( R,(r) - v^'); ifD,>(l It searches ihe list ft i ti U it finds the larjjesì imeger m (< r) |hai satisfies R^r) ft,lm) > Di. Tlieü it set« ils flaji tCh 1 and considers V [Corresponding to i|s cli«ckpoifit (i.e. Cy is replaced by C^m ) lor tlie next iterati chu: Stup 6: Olt-Vt/mitlf-t Cj.n Cir./. .... Cl mi-f (ff fUfifllf/ *! else Ii selli ii« 10 O and considers Vi^t Q, Ter ihe n<}M iiiiniiioni li riMivtfs tilt fla^anij Vj from each process P,; 11' Mag ■ ■ O for cfith protest f,. 0 < j < n -1 fQSvhiiily i:ifa%i%s<'nt i'hcchp'iinlf tH'liiHf(ing (viht: mü^cintüm ivitfiwnl .vfc't' tin' iji'lt'rniim'il *■' else ir recovery on failyfu [ l',aìjkìj eacli P|to rfMan I tic application protrimi Trom il s la;t vii eck poi nt corresptmdine io which Pj : P,ìiiipleniems Rule I corespotiding to iis resianinscfieckpinini ai wliich < U: ti resians computati on Tfoin ifie restimi ng check poi lit (iis tlfC): P,a?ki e^cli prowss Pj lo continue its tionnal computai ion from its latesi checkpoint: P, iniplemenis Rule 2 foUo^ved by Rule I ; F^iimlk ti<^fei-aii'Uiik>!i ofcjCCs it dene V II continues its normal cjmputaiioii I'rom its laiesi cliecJipoini: Cooirol flows 10 Step 3: PcMesj. P,: Step I : P, leceii es reguesi from Pi: if Pi has (wiuested lo resisn /• Theijslem rvslar/s tifier rvi.tn eo' fi'i"» « fiiilurc*/ Step 2; Step J; Step 4: Pjiinplepiienii Rule 1 «rresponding to iis resianinj clieckpoini si which Dj 5 0; It resians compulsi ion frum itie restarting viieckpcim (iis (JCO: else ir P, hss requesied lo eoniinue wiih the applicai ion prostam P, innplemenis Rule 2 followed by Rule 1: Ffiivtlk-dflfrmi'mlhn ufCCC is dfne '/ It eoniinuei iis normal comp ina t ion früni iis la lesi clieckpoini: else h studs Vj coirespünding lo iis laiest checkpoini (."i^io ihe iniiiaior process Pi; Il receives from Pi. Il comptpi« D, by cal tu lai ing (Rj(r) - vV); ifDpO Il seanjhes the lisi Rj lili il finds ihe laigesi inieger in (< r) ihai saiisfiej Rj^r) - Rj(ni) >0,. Then ii sends a flag of 1 and Vjlo Pi corwiponding io ìtscheckpoini Cj.,„(-V is replaced by fj.m ); else /* C'ftfflpfifftf C)„ Cf^i, .... Q „i-fj lire skippetl */ It scndii a fla^ of 0 and Vjal Cj_, lo the initiator protruis P,; Figure 6: The responsibilities of the initiator process Pi and each participating process Pj for the enhanced algorithm. 9 References [1] R. Koo and S. Toueg, "Checkpointing and Rollback-Recovery for Distributed Systems", IEEE trans. Software Engineering, vol. SE-13, no. 1, pp. 23-31, Jan 1987. [2] Y. M. Wang, A. Lowry, and W. K. Fuchs, "Consistent Global Checkpoints Based on Direct Dependency Tracking", Information Processing Letters, vol. 50, no. 4, pp. 223-230, May 1994. [3] K. M. Chandy and L. Lamport, "Distributed Snapshots: Determining Global States of Distributed Systems", ACM Trans. Computing Systems, vol.3, no. 1, pp. 63-75, Feb. 1985. [4] Y. Wang, "Consistent Global Checkpoints that Contain a Given Set of Local Checkpoints", IEEE Trans. Computers, vol. 46, no. 4, pp. 456-468, April 1997. [5] M. Singhal and N. G. Shivaratri, Advanced Concepts in Operating Systems, McGraw-Hill, 1994. [6] S. Venkatesan, T. Juang, and S. Alagar, "Optimistic Crash Recovery Without Changing Application Messages", IEEE Trans. Parallel and Distributed Systems, vol. 8, no.3, pp. 263-271, March 1997. [7] R. Baldoni, F. Quaglai, and P. Fornara, "An Index-Based Checkpointing Algorithm for Autonomous Distributed Systems", IEEE Trans. Parallel and Distributed System, vol. 10, no.2, pp.181-192, Feb. 1999. [8] J. Tsai, S. -Y. Kuo, and Y. -M. Wang, "Theoretical Analysis for Communication-Induced Checkpointing Protocols with Rollback Dependency Trackability", IEEE Trans. Parallel and Distributed Systems, vol. 9, no. 10, pp. 963971, Oct. 1998. [9] G. Cao and M. Singhal, " On Coordinated Checkpointing in Distributed Systems, IEEE Trans. Parallel and Distributed Systems ", vol. 9, no. 12, pp. 1213-1225, Dec.1998. [10] P. Jalote, Fault Tolerance in Distributed Systems, PTR Prentice Hall, (1994), Addison-Wesley, (1998). [11] B. Gupta, S.K. Banerjee and B. Liu, "Design of new roll-forward recovery approach for distributed systems", IEE Proc. Computers and Digital Techniques, Volume 149, Issue 3, pp. 105-112, May 2002. [12] D. Manivannan and M. Singhal, "Quasi-synchronous checkpointing: Models, characterization, and classification", Parallel and Distributed Systems, IEEE Transactions on Volume 10, Issue 7, pp. 703- 713, July 1999. [13] D. Manivannan, and M. Singhal, "Asynchronous recovery without using vector timestamps", Journal of Parallel and Distributed Computing, Volume 62, Issue 62 pp. 1695-1728, Dec 2002. [14] M. Ohara., M. Arai., S. Fukumoto., and K. Iwasaki.,"Finding a Recovery Line in Uncoordinated Checkpointing", Proceedings 24th International Conference on Distributed Computing Systems Workshops (ICDCSW'04), pp. 628 - 633, 2004. [15] R. H. B. Netzer and J. Xu, "Necessary and Sufficient Conditions for Consistent Global Snapshots", IEEE Trans. Parallel and Distributed Systems, vol. 6, no. 2, pp. 165-169, Feb. 1995. [16] T. Juang and S. Venkatesan, "Crash Recovery with Little Overhead", Proc. 11*^ International Conference on Distributed Computing Systems, pp. 454-461, May 1991. [17] B. Gupta, Y. Yang, S. Rahimi, and A. Vemuri, "A High-Performance Recovery Algorithm for Distributed Systems", Proc. 21st International Conference on Computers and Their Applications, pp. 283-288, Seattle, March 2006. [18] S. Monnet, C. Morin, R. Badrinath, "Hybrid Checkpointing for Parallel Applications in cluster Federations", Proc. 4th IEEE/ACM International Symposium on Cluster Computing and the Grid, Chicago, IL, USA, pp. 773-782, April 2004. [19] J. Cao, Y. Chen, K. Zhang and Y. He, "Checkpointing in Hybrid Distributred Systems", Proc.7'h International Symposium on Parallel Architectures, Algorithms and Networks (ISPAN'04), pp. 136-141, May 2004. Discovering Hidden Knowledge from Biomedical Literature Ingrid Petrič1, Tanja Urbančič12 and Bojan Cestnik23 1University of Nova Gorica Vipavska 13, 5000 Nova Gorica, Slovenia 2Jozef Stefan Institute Jamova 39, 1000 Ljubljana, Slovenia 3Temida, d.o.o. Dunajska 51, 1000 Ljubljana, Slovenia E-mail: ingrid.petric@p-ng.si, tanja.urbancic@p-ng.si, bojan.cestnik@temida.si Keywords: text mining, ontology construction, autism Received: June 3, 2006 In this paper we investigate the potential of text mining for discovering implicit knowledge in biomedical literature. Based on Swanson's suggestion for hypotheses generation we tried to identify potential contributions to a better understanding of autism focusing on articles from database PubMed Central. First, we used them for ontology construction in order to obtain an improved insight into the domain structure. Next, we extracted a few rare terms that could potentially lead to new knowledge discovery for the explanation of the autism phenomena. We present a concrete example of such constructed knowledge about a substance calcineurin and its potential relations with other already published indications of autism. Povzetek: Prispevek opisuje uporabo metod rudarjenja besedil na medicinskih člankih s področja avtizma. 1 Introduction The practice of biomedicine is, as well as other activities of our society, inherently an information-management task (Shortliffe, 1993). Internet, the very common and increasingly used information source, provides massive heterogeneous collections of data. Huge bibliographic databases thus often contain interesting information that may be inexplicit or even hidden. One of such databases is MEDLINE, the primary component of PubMed, which is the United States National Library of Medicine's bibliographic database. It covers over 4.800 journals published in more than 70 countries worldwide and thus contains over 14 million citations from 1966 to the present (PubMed, 2006). The daily increasing number of biomedical articles provides a huge potential source of new data. In MEDLINE database there are between 1.500-3.500 complete references added since 2002 each day from Tuesday to Saturday (PubMed, 2006). There is an urgent need to assist researchers in extracting knowledge from the rapidly growing volumes of databases in order to improve the usefulness of these vast amounts of data. For such reasons, the ability to extract the right information of interest remains the subject of the growing field of knowledge discovery in databases. Knowledge discovery is the process of discovering useful knowledge from data, which includes data mining as the application of specific algorithms for extracting patterns from data (Fayyad et al., 1996). In fact, important information hidden in huge databases could be discovered by data mining and knowledge discovery techniques. More specifically, those databases that contain bibliographic semi-structured data can be approached by text mining as specific kind of data mining. Although the technology for data and text mining is well advanced, its potential still seems to lack sufficient recognition. Healthcare in general is one of the slowest sectors in utilizing information and communication technologies to their full benefit; however, the need for computer literacy has already been recognised and acknowledged by professionals in this sector (Štepankova, Engova, 2006). Therefore, one of the major challenges of biomedical text mining over the next 5 to 10 years is to make these techniques better understood and more useful to biomedical researchers (Cohen, Hersh, 2005). At the same time, the continued cooperation with professional communities such as the biomedical research community is required to ensure that their needs are properly addressed. Such collaboration is particularly crucial in complex scientific areas, as for example in autism field of biomedical research. The specific requirements in autism research, as presented by Zerhouni (2004), actually emphasise the need for increasing the efficiency of communication of research findings to the related science community. Autism belongs to a group of pervasive developmental disorders that in most cases have unclear origin. The main characteristic components of abnormal functioning in autism are the early delay and abnormal development of cognitive, communication and social interaction skills of affected individuals. In the fourth, revised edition of Diagnostic and Statistical Manual of Mental Disorders, a category of pervasive developmental disorders refers to a group of symptoms of neurological development, connected with early brain mechanisms that in large extent condition the social abilities already in the childhood (American Psychiatric Association, 2000). Such heterogeneous features of autistic developmental disturbance and its different degrees of affecting children have led to contemporary naming of autism conditions with the term: Autism spectrum disorders, to which suits the abbreviation ASD. The lack of studies, evidenced by Zerhouni (2004), that would increase the knowledge about risk factors and early development of autism, and that would better define characterization of autism spectrum disorders, has led us to choose the autism as an application domain of our research in knowledge technologies. In this article we focus on the areas and methods where text mining potentially enriches biomedical science and thus interdisciplinary connects information technologies with biomedical expert knowledge. First we describe several text mining approaches in real biomedical settings towards extracting knowledge from data. Then we present our approach towards integration of real problem analysis and extraction of potentially useful information from data. Our main aim was to extract some implicit and previously unknown interesting information from professional articles about autism. Some of our text mining results are finally described with example pairs of implicit connections that we managed to identify from biomedical articles. 2 Text mining in biomedicine There are several biomedical examples, where data mining has been successfully applied, as described in a review by Van Someren and Urbančič (2006). Examples include diagnosis, where data mining relates symptoms and other attributes of patients to their disease, subgroups of patients that are at risk for certain disease, and gene expression, with a growing number of applications, where predictions and identifications of disease markers are made, based on features of genes. While data mining usually operates with collections of well structured data, researchers often have to deal with semi-structured text collections, too. Such datasets require the use of text mining techniques. Extracting important information from the increasingly available biomedical knowledge represented in digital text forms, has been proved as an important opportunity for biomedical discoveries and hypothesis generation. Having access and ability to work with the newest information, indeed means great potential for experts, who can benefit from the advantages of information systems and technologies. Biomedical informatics thus presents an essential element of biomedical research process. Methods that have been recently used for biomedical text mining tasks include the following items (Cohen, Hersh, 2005): • Named entity recognition in order to identify all of the instances of a name for specific type of domain, within a collection of text; Examples of recent areas of biomedical research: • drug names within published journal articles, • gene names and their symbols within a collection of MEDLINE abstracts. Text mining approaches: lexicon-based, rules-based, statistically based, combined. • Text classification with the goal to automatically determine whether a document or a part of it has particular attributes of interest; Examples of recent areas of biomedical research: • documents discussing a given topic, • texts containing a certain type of information. Text mining approaches: classification rule induction. • Synonym and abbreviation extraction with the attempt to speed up literature search with automatic collections of synonyms and abbreviations for entities; Examples of recent areas of biomedical research: • gene name synonyms, • biomedical term abbreviations. Text mining approaches: combination of named entity recognition system, with statistical, support vector machine classifier-based, and automatic or manual pattern-based matching rules algorithms. • Relationship extraction with the goal to recognize occurrences of a pre-specified type of relationship between a pair of entities of specific types; Examples of recent areas of biomedical research: • relationships between genes and proteins, • text-based gene clustering. Text mining approaches: neighbour divergence analysis, vector space approach and k-medoids clustering algorithm, fuzzy set theory on co-occurring dataset records, type and part-of-speech tagging. • Integration frameworks with intention to address many different user needs; Examples of recent areas of biomedical research: • comparison of gene names and functional terms, • gene based text clusters. Text mining approaches: template-based, text profiling and clustering based. • Hypothesis generation that focuses on the uncovering of implicit relationships, worthy of further investigation, that are inferred by the presence of other more explicit information; Examples of recent areas of biomedical research: • connection between patient benefit and food substances, • potential new uses and therapeutic effects of drugs. Text mining approaches: Swanson's ABC modelbased. In the continuation we concentrate on hypotheses generation as a central point of our research interest. 3 Related work The machine learning process is characterized by the search space, which reflects the expression of the hypothesis language, as a target knowledge (Botta et al., 2003). Idea of the text mining approach towards hypothesis generation, known as Swanson's ABC model, consists of discovering complementary structures in disjoint journal articles. This model assumes that when one literature reports that agent A causes phenomenon B, and second literature reports that B influences C, we could propose that agent A might influence phenomenon C (Swanson, 1990). To find some published evidence leading to undiscovered knowledge, the A and C literatures should have few or no published articles in common. In such way, Swanson discovered, among other, several relationships that connected migraine and decreased levels of magnesium (Swanson, 1990). To facilitate the discovery of hypotheses by linking findings across literature, Swanson and his colleagues designed a set of interactive software that is available on a web-based system called Arrowsmith (Smalheiser, Swanson, 1998). Pratt and Yetisgen-Yildiz (2003) designed LitLinker that uses data mining techniques to identify correlations among concepts and then uses those correlations for discovery of potential causal links between biomedical terms. Sehgal et al. presented a system that may be used to explore topics and their relationships using text collections such as MEDLINE (Sehgal et al., 2003). Weeber et al. experimented with Swanson's idea of searching the literature for generating new potential therapeutic uses of the drug thalidomide with the use of a concept-based discovery support system DAD on the scientific literature (Weeber et al., 2003). Another example of discovering new relations from bibliographic database according to Swanson's model is identification of disease candidate genes by an interactive discovery support system for biomedicine Bitola (Hristovski et al., 2005). Transitive text mining was explored also by Grohmann and Stegmann (2005), who developed a web-based tool, C-MLink. For successful data mining a wide background knowledge concerning the problem domain presents a substantial advantage. In fact, hypothesis generation from text mining results relies on background knowledge, experience, and intuition (Srinivasan, 2004). With this consideration we started our examination of autism phenomena with the identification of its main concepts and the review of what is already known about autism. We identified such information by ontologies construction, which we found a very fast and effective way of exploring large datasets. Ontologies in general with their capability to share a common understanding of domains support researches with ability to reason over and to analyse the information at issue (Joshi, Undercoffer, 2004). Many tools that help constructing ontologies from texts were developed and successfully used in practice (Brank et al., 2005). Among them, OntoGen (Fortuna et al., 2006), the interactive tool for semi-automatic construction of ontologies, received a remarkable attention. 4 Identification of domain structure An important goal in our recognition of autism phenomena was to uncover the fundamental concepts that provide the common knowledge about autism. To identify some useful pieces of knowledge from the large amount of digital articles one approach would be to read and manually analyse all available data. Since this is evidently a time consuming task, we instead chose to guide our attention only on the most relevant information about the domain of interest. We performed our research with the computational support of OntoGen. 4.1 Target dataset We decided to analyse the professional literature about autism that is publicly accessible on the World Wide Web in the database of biomedical publications, PubMed. In the PubMed database we found 10.821 documents (till August 21, 2006) that contain derived forms of autis*, the expression root for autism. There were 354 articles with their entire text published in the PubMed Central database. Other relevant publications were either restricted to abstracts of documents or their entire texts were published in sources outside PubMed. From the listed 354 articles we further restricted the target set of articles on documents to those that have been published in the last ten years. As a result, we got 214 articles from 1997 forward, which we decomposed to titles, abstracts and texts for the purpose of further analysis. 4.2 Text mining support system One of the most frequently used text representations in text mining is word-vector representation, where the word-vector contains some weight for each word of text, proportional to the number of its occurrences in the text (Mladenić, 2006). Such representations are used also by OntoGen, which enables interactive construction of ontologies in a chosen domain. We used it to construct several autism ontologies. The input for the tool is a collection of text documents. With machine learning techniques OntoGen supports important phases of ontology construction by suggesting concepts and their names, by defining relations between them, and by automatic assignment of documents to the concepts (Fortuna, 2005). 4.3 Ontology of autism domain Our aim was first to review the autism literature and to identify the most frequent topics researched in this domain. With this intention we built the autism domain ontology with OntoGen on 214 articles from PubMed Central database that treat problems of autism. OntoGen displayed sub-concepts of autism domain as suggested by its clustering algorithm, and described them with their main keywords extracted from text documents. The keywords that we used for concepts description were calculated both according to the concept centroid vector, and by the Support Vector Machine based linear model. The system also displayed the current coverage of each concept by the number of documents that it positively classified into the concept and the inner-cluster similarity measures. Ontologies built with OntoGen, as an example shown in Figure 1, actually helped us to substantially speed up the process of reviewing and understanding the complex and heterogeneous spectrum of scientific articles about autism. mmr vaccine Supertopic-of language Su p erto pic-of disorder teaching, training Supertopic-of Supertcpic-cf i/ stereotvpy. behavioral problems reinfo rcers, stimulus senscry, auditory respcnse Figure 1: Concepts of autism ontology with 7 subgroups, built on 214 abstracts from the PubMed Central database. The main concepts of autism phenomena as they result from the first level of our ontology model (first level subgroups of autism domain) are: genetics; teaching and training; reinforcers and stimulus; sensory and auditory response; stereotypy and behavioural problems; language disorders, and MMR (Measles, Mumps, and Rubella) vaccine. Important confirmation of the resulted ontology construction is the recent state of autism research as described by Zerhouni (2004) that summarizes the main scientific activities of autism research in the major areas of epidemiology, genetics, neurobiology, environmental factors and specific treatments of autism. 5 Extraction of implicit relationships from autism data Besides constructing an ontology on the input file of texts, OntoGen creates also a *.txt.stat file with statistical incidence of terms as they appear in documents collected in the input dataset. We utilized this OntoGen's byproduct as the basis for our approach toward the identification of rare relations between autism data. As our goal was to discover undocumented knowledge about autism phenomena, we assumed that starting our search on rare connections between data rather than on frequent ones, we would have better chances to discover implicit relations that are still unknown and might, however, be useful for the autism researchers. 5.1 The related concepts Our approach towards discovering knowledge about autism concentrated on identifying interesting concepts within autism sub-areas of interest. Therefore, we considered the subdivision of autism domain on research fields; moreover, we particularly guided our attention on neurobiological basis of autistic abnormalities. To find some related concepts, which would lead us to potential discoveries of new knowledge, we took the *.txt.stat file created by OntoGen while constructing ontologies. We first focused our attention on those terms listed in this text file that appeared only in one article from the input dataset. Taking into account also background knowledge about autism, we chose words that could be useful for autism discovery. Three of the chosen terms, presented also in the intersection area in Figure 2, are: lactoylglutathione, synaptophysin and calcium_channels. There are three major reasons for these choices. First, we found that an increase in polarity of glyoxalase I in autism brains was reported and that glyoxalase system involves also lactoylglutathione. Second, as the altered synaptic function was also discussed in autism articles, we took in consideration synaptophysin, a protein localized to synaptic vesicles. And third, abnormal calcium signalling was found in some autistic children, thus we chose also term calcium_channels for further discovery. After selecting these three terms of interest, we searched the article database to find what all these terms have in common. One of the goals of text mining is to automatically discover interesting hypotheses from a potentially useful text collection (Srinivasan, 2004). By text mining on PubMed articles that treat these selected terms domains, we constructed their ontologies and from the OntoGen's *.txt.stat files we retrieved the words they all have in common (the words that appeared in the three *.txt.stat files). One of such terms, listed also in Figure 2, that could be interesting for the hypothesis generation and forward research on autism phenomena, is calcineurin. Calcineurin is calcium- and calmodulin-dependent serine/threonine protein phosphatase, which is widely present in mammalian tissues, with the highest levels found in brain (Rusnak, Mertz, 2000). Our literature mining in disjoint journal articles showed that it could be related to autistic disorders, however to the present no direct evidence of calcineurin role in autism has been reported yet on the internet. lactovlglutathione.M.stat Autism literature Calcineurin literature ly^iii calcineunr, cvsteine, magnesium, melanoma, oxidative stress synaptophv; in.txtstat calcium Autism articles from Pubt^ed cliannels.tirt.stat Fatemi et al. (2001) reported a reduction of Bcl-2 (a regulatory protein for control of programmed brain cell death) levels in autistic cerebellum. Erin et al. (2003) observed that calcineurin occurred as a complex with Bcl-2 in various regions of rat and mouse brain. Qiu et al. (2006) described the low-density lipoprotein receptors that regulate cholesterol transport, in neuropsychiatric disorders, such as autism. Cofan et al. (2005) published their article about effect of calcineurin inhibitors on low-density lipoprotein oxidation. Bear et al. (2004) reported about the loss of fragile X protein, an identified cause of autism that increased long-term depression in mouse hippocampus. Zhabotinsky et al. (2006) described induction of long-term depression that depends on calcineurin. Figure 2: Results of our approach to literature mining on autism domain. 5.2 The explored conjectures In order to justify the role of calcineurin in autistic problem domain we decided to search and explore possible reasoning paths that relate the selected substance to some known expressions of autism. Since the direct relation was not yet noted in the literature, our goal was to find a few plausible interconnecting terms that relate the two notions (Swanson, 1990). Having this in mind, we explored the union of PubMed articles about autism and articles about calcineurin. By building ontologies on such input dataset of combined articles the goal was to discover documents having as much as possible words in common. For this purpose we searched for the highest similarity measures inside the clusters of ontologies. Interestingly, by this search we identified several pairs of instances of PubMed articles that are connecting the two categories of biomedical literature, autism and calcineurin, respectively. This way we were able to find eleven pairs of articles, which, when put together, could be seen as arguments for new hypotheses of autism and calcineurin relationship, such as the three listed in Table 1. When showing the presented results to the expert of autistic spectrum and related disorders, she not only confirmed strong interest in the method and in the discovered relations, but was also able to guide our further work very efficiently by turning our attention on discovering the relationship between autism and fragile X chromosome. Table 1: Hypotheses for calcineuring and autism relationship. 6 Conclusion Our study confirms the potential of ontology construction by OntoGen on biomedical literature to systematically structure main concepts. The evaluation of the ontology constructed on autism showed important similarity to the reported state of autism research. Considering OntoGen's statistical data can lead to discovery of potentially useful and previously unknown information related to the researched phenomena. In such way, OntoGen's functionality can be extended to retrieve new information from vast amounts of textual data that experts otherwise have to explore manually. As connecting sets of literature about synaptophysin, lactoylglutathione and calcium channels that were selected as three interesting rare terms from autism articles, we found calcineurin, cysteine, magnesium, melanoma, oxidative stress and many others. In the preliminary expert evaluation the approach proposed in this paper proved to be successful. However, further assessment of the possible role of calcineurin and other resulting candidates in autism is needed to justify our methodological approach and to see if it can contribute to the knowledge corpus of autism phenomena. Acknowledgement This work was partially supported by the Slovenian Research Agency programme Knowledge Technologies (2004-2008). We thank Nada Lavrač for her suggestion to use OntoGen and Blaž Fortuna for his discussions about OntoGen's performance. We also appreciate help and support we got from Marta Macedoni-Lukšič in our efforts to better understand autism. References [1] American Psychiatric Association (2000) Diagnostic and Statistical Manual of Mental Disorders, Fourth Edition, Text Revision. Washington, DC. [2] Bear M.F., Huber K.M., Warren S.T. (2004) The mGluR theory of fragile X mental retardation, Trends in Neurosciences, 27(7), pp. 370-377. [3] Botta M., Saitta L., Sebag M. (2003) Relational Learning as Search in a Critical Region, Journal of Machine Learning Research, 4, pp. 431-463. [4] Brank J., Grobelnik M., Mladenić D. (2005) A survey of ontology evaluation techniques, SIKDD 2005 at multiconference IS 2005, Ljubljana, Slovenia. [5] Cofan F., Cofan M., Campos B., Guerra R., Campistol J.M., Oppenheimer F. (2005) Effect of calcineurin inhibitors on low-density lipoprotein oxidation, Transplantation Proceedings, 37(9), pp. 3791-3793. [6] Cohen A.M., Hersh W.R. (2005) A Survey of Current Work in Biomedical Text Mining, Briefings in Bioinformatics, 6(1), pp. 57-71. [7] Erin N., Bronson S.K., Billingsley M.L. (2003) Calcium-dependent interaction of calcineurin with Bcl-2 in neuronal tissue, Neuroscience, 117(3), pp. 541-555. [8] Fatemi S.H., Stary J.M., Halt A.R., Realmuto G.R. (2001) Dysregulation of Reelin and Bcl-2 proteins in autistic cerebellum, Journal of Autism and Developmental Disorders, 31(6), pp. 529-535. [9] Fayyad U., Piatetsky-Shapiro G., Smyth P. (1996) Knowledge Discovery and Data Mining: Towards a Unifying Framework. Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, Portland, Oregon. [10] Fortuna B. (2006) [http://ontogen.ijs.si/index.html], OntoGen: Description. [11] Fortuna B., Grobelnik M., Mladenić D. (2006) System for semi-automatic ontology construction. Demo at ESWC 2006, Budva, Montenegro. [12] Grohmann G., Stegmann J., C-MLink: a web-based tool for transitive text mining, Proceedings of the 10'th International Conference of the International Society for Scientometrics and Informetrics, Sweden, Stockholm, pp. 658-659 [13] Hristovski D., Peterlin B., Mitchell J.A., Humphrey S.M. (2005) Using literature-based discovery to identify disease candidate genes, International Journal of Medical Informatics, 74, pp. 289-298. [14] Joshi A., Undercoffer J.L. (2004) On Data Mining, Semantics, and Intrusion Detection. What to Dig for and Where to Find It, Data mining. Next Generation Challenges and Future Directions, Menlo Park, California, pp. 437-460. [15] Mladenić D. (2006) Text Mining: Machine Learning on Documents, Encyclopedia of Data Warehousing and Mining, Hershey: Idea Group Reference, pp. 1109-1112. [16] Pratt W., Yetisgen-Yildiz M. (2003) LitLinker: Capturing Connections across the Biomedical Literature, Proceedings of the International Conference on Knowledge Capture (K-Cap'03), Florida, pp. 105-112. [17] PubMed (2006) [http://www.ncbi.nlm.nih.gov/], Overview. [18] Qiu S., Korwek K.M., Weeber E.J. (2006) A fresh look at an ancient receptor family: emerging roles for density lipoprotein receptors in synaptic plasticity and memory formation, Neurobiology of Learning and Memory, 85(1), pp. 16-29. [19] Rusnak F., Mertz P. (2000) Calcineurin: Form and Function, Physiological Reviews, 80(4), pp. 14831521. [20] Sehgal A., Qiu X.Y., Srinivasan P. (2003) Mining MEDLINE Metadata to Explore Genes and their Connections, Proceedings of the SIGIR 2003 Workshop on Text Analysis and Search for Bioinformatics. [21] Shortliffe E.H. (1993) The Adolescence of AI in Medicine: Will the Field Come of Age in the '90s? Artificial Intelligence in Medicine, 5(2), pp. 93-106. [22] Smalheiser N.R., Swanson D.R. (1998) Using ARROWSMITH: a computer-assisted approach to formulating and assessing scientific hypotheses, Computer Methods and Programs in Biomedicine, 57, pp. 149-153. [23] Srinivasan P. (2004) Text mining: Generating hypotheses from MEDLINE, Journal of the American Society for Information Science and Technology, 55(5), pp. 396-413. [24] Swanson D.R. (1990) Medical literature as a potential source of new knowledge, Bulletin of the Medical Library Association, 78(1), pp. 29-37. [25] Štepankova O., Engova D. (2006) Professional Competence and Computer Literacy in e-age, Focus on Healthcare, Methods of Information in Medicine; 45, pp. 300-305. [26] Van Someren M., Urbančič T. (2006) Applications of machine learning: matching problems to tasks and methods, The Knowledge Engineering Review, 20(4), pp. 363-402. [27] Weeber M., Vos R., Klein H., De Jong-van den Berg L.T., Aronson A.R., Molema G. (2003) Generating Hypotheses by Discovering Implicit Associations in the Literature: A case Report of a Search for New Potential Therapeutic Uses for Thalidomide, Journal of the American Medical Informatics Association, 10(3), pp. 252-259. [28] Zerhouni E.A. for National Institutes of Health and National Institute of Mental Health (2004) Congressional Appropriations Committee Report on the State of Autism Research. Department of Health and Human Service, Bethesda, Maryland. [29] Zhabotinsky A.M., Camp R.N., Epstein I.R., Lisman J.E. (2006) Role of the neurogranin concentrated in spines in the induction of long-term potentiation, Journal of Neuroscience, 26(28), pp. 7337-7347. Approximate Representation of Textual Documents in the Concept Space Jasminka Dobša University of Zagreb, Faculty of Organization and Informatics Pavlinska 2, 42 000 Varaždin, Croatia jasminka.dobsa@foi.hr Bojana Dalbelo Bašić University of Zagreb, Faculty of Electrical Engineering and Computing Unska 3, 10 000 Zagreb, Croatia Bojana.Dalbelo@fer.hr Keywords: dimensionality reduction, concept decomposition, information retrieval Received: November 17, 2006 In this paper we deal with the problem of addition of new documents in collection when documents are represented in lower dimensional space by concept indexing. Concept indexing (CI) is a method of feature construction that is relying on concept decomposition of term-document matrix. By using CI original representations of documents are projected on the space spread by centroids of clusters, which are called concept vectors. This problem is especially interesting for application on World Wide Web. Proposed methods are tested for the task of information retrieval. Vectors on which the projection is done in the process of dimension reduction are constructed on the basis of representations of all documents in the collection, and computation of the new representations in the space of reduced dimension demands recomputation of concept decomposition. The solution to this problem is the development of methods which will give approximate representation of newly added documents in the space of reduced dimension. In the paper are introduced two methods for addition of new documents in the space of reduced dimension. In the first method there no addition of new index terms and added documents are represented by existing list of index terms, while in the second method list of index terms is extended and representations of documents and concept vectors are extended in dimensions of newly added terms. It is shown that representation of documents by extended list of index terms does not improve performance of information retrieval significantly. Povzetek: Predstavljeni sta dve metodi konceptualnega indeksiranja dokumentov. 1 Introduction In this paper we deal with the problem of addition of new documents in collection when documents are represented in lower dimensional space by concept indexing. This problem is especially interesting for application on World Wide Web. Proposed methods are tested for the task of information retrieval [1]. There are lots of motives for dimension reduction in the vector space model: decrease of memory space needed for representation of documents, faster performance of information retrieval or automatic classification of documents, reduction of noise and redundancy present in the representation of documents. Methods for dimension reduction in the vector space model based on extraction of new parameters for representation of documents (feature construction) tend to overcome the problem of synonyms and polysemies which are two major obstacles in information retrieval. Disadvantage of feature construction may be uninterpretability of newly obtained parameters or features. Our investigation is based on the method of feature construction called concept indexing which was introduced in 2001 by Dhillon and Modha [7]. This method uses centroids of clusters created by the spherical k-means algorithm or so-called concept decomposition (CD) for lowering the rank of the term-document matrix. By using CI original representations of documents are projected on the space spread by centroids of clusters, which we call here concept vectors. Representation of new document in the vector space model is trivial. The problem appears when we want to add new documents in the space of reduced dimension. Namely, vectors on which the projection is done in the process of dimension reduction are constructed on the basis of representations of all documents in the collection, and computation of the new representations in the space of reduced dimension demands recomputation of the concept decomposition. The solution to this problem is the development of methods which will give approximate representation of newly added documents in the space of reduced dimension. Application of such a methods will delay a process of recomputation of concept decomposition. Methods for addition of representations of new documents in the space of reduced dimension are already developed for LSI method [3,9]. The method of LSI was introduced in 1990 [4] and improved in 1995 [3]. Since then LSI is a benchmark in the field of dimension reduction. Although the LSI method has empirical success, it suffers from the lack of interpretation of newly obtained features which causes the lack of control for accomplishing specific tasks in information retrieval. Kolda and O'Leary [8] developed a method for addition of representations of new documents for LSI method that uses semi-discrete decomposition which saves memory space. When the collection of documents is extended it seems natural to extend also the list of index terms with terms present in added documents, which were not present in starting collection of documents, or were present very rarely and they were not included in the list of the index terms. In the paper are introduced two methods for addition of new documents in the space spread by concept vectors, which is called concept space. In the first method there no addition of new index terms and added documents are represented by existing list of index terms, while in the second method list of index terms is extended and representations of documents and concept vectors are extended in dimensions of newly added terms. This paper is organized as follows. Section 2 provides a description of technique of dimensionality reduction by concept decomposition. In Section 3 novel algorithms for approximate addition of documents in concept space are proposed. Section 4 provides an example, while Section 5 describes experiment where proposed algorithms are tested. Last section gives conclusions and directions for further work. 2 Dimensionality reduction by the concept decomposition Let the m x n matrix A = [a^] be the term-document matrix. Then a.j is the weight of the i-th term in the j-th document. A query has the same form as a document; it is a vector whose i-th component is the weight of the i-th term in the query. A common measure of similarity between the query and the document is the cosine of the angle between them. Techniques of feature construction enable mapping documents' representations, which are similar in their content, or contain many index terms in common, to the new representations in the space of reduced dimension, which are closer than their representations in original vector space. That enables retrieving of documents which are relevant for the query, but do not contain index terms contained in the vector representation of query. In this section we will describe the algorithm for computation of concept decomposition by the fuzzy k-mans algorithm [5]. 2.1 Fuzzy k-means algorithm The fuzzy k-means algorithm (FKM) [10] generalizes the hard k-means algorithm. The goal of the k-means algorithm is to cluster n objects (here documents) in k clusters and find k mean vectors or centroids for clusters. Here we will call these mean vectors concept vectors, because that is what they present. As opposed to the hard k-means algorithm, which allows a document to belong only to one cluster, FKM allows a document to partially belong to multiple clusters. FKM seeks a minimum of a heuristic global cost function k n J fuzz = |a j - cj where a j, j = 1,k, n are i=1 j=1 , k vectors of documents, c;-, i = 1, are concept vectors, ßj is the fuzzy membership degree of document aj in the cluster whose concept is ci and b is a weight exponent of the fuzzy membership. In general, the J fuzz criterion is minimized when concept c;' is close to those documents that have a high fuzzy membership degree for cluster i , i = 1,k, k. By , . t f d J fuzz . d J fuzz solving a system of equations - and-, we d ci dßj obtain a stationary point for which fuzzy membership degrees are given by 1 ^ij =- (1) / r=1 2 A V cr b-1 / for i = 1, k, k and j = 1, k , n , while centroids or concept vectors are given by ' ß'b a; j=1 (2) ' ßb j=1 for i = 1, K, k . For such a stationary point the cost function reaches a local minimum. We will obtain concept vectors by starting with arbitrary initial concept vectors c(0), i = 1, k , k and by computing fuzzy membership degreesß^.), cost function Jf^zz and new a c a concept vectors c('+1) iterative, where t is the index of iteration, until threshold £. J(t+1) _ J(t) J fuzz J fuzz < £ for some 2.2 Concept decomposition Our target is to approximate each document vector by a linear combination of concept vectors. The concept matrix is an m x k matrix whose j-th column is the that is C k = Ci,K,c k If we concept vector c,-, ihai is c k assume linear independence of the concept vectors, then it follows that the concept matrix has rank k. Now we define the concept decomposition Pk of the term-document matrix A as the least-squares approximation of A on the column space of the concept matrix Ck. Concept decomposition is an m x n matrix Pk = Ck Z* where Z* is the solution of the least- squares problem, ie. Z* = (d Ck ) 1 Ok A . Z is a matrix of the type k^n and its columns are representations of documents in the concept space. Similarly, representation of query q in the reduced dimension space is given by (c! Ck ) 1 Ck q and similarity between document and the query is given by the cosine of the angle between them. Concept indexing is a technique of indexing text documents by using concept decomposition. 3 Addition of representations of new documents in the concept space In this section novel algorithms for addition text documents' representations in the concept space are proposed. The goal is to add new documents in a collection represented in the reduced dimension space, and this goal is achieved with and without an extension of the list of the index terms. Let us introduce matrix notation that will be used in the section. Matrix A = A1 A 2 AA (3) will be an extended term-document matrix, where A1 a is matrix of starting documents in the space of starting terms, A3 is a matrix of starting documents in the space of added terms, A2 is a matrix of added documents in the space of starting terms and A4 is a matrix of added documents in the space of added terms. Further, let mi be number of starting terms, m2 number of added terms, nt number of starting documents and n2 number of added documents. Here we will introduce two methods of approximate addition of new documents in the concept space: (a) projection of new documents on existing concept vectors (Method A) and, (b) projection of new documents on existing concept vectors extended in dimensions of newly added terms (Method B). Assume that documents of a starting matrix A1 are clustered by fuzzy k-means algorithm and centroids of clusters are computed. Let C1 be the concept matrix the columns of which are concept vectors and let C2 be a matrix consisting of extensions of concept vectors in dimensions of added terms. Concept vectors of the matrix C1 are calculated by the formula (2) using columns of matrix A1 as document representations, while extensions of concept vectors are calculated by the same formula using respective columns of matrix A3 as representations of starting documents in the space of added terms. Let extensions of concept vectors form extension of the concept matrix denoted by C2. Then ^ [Ci 1 C = is the concept matrix the columns of which _C 2 J are concept vectors extended in dimensions of newly added terms. Representations of documents in the concept space of extended term-document matrix will be given by expression [ C1 " f "C1 " -1 " C1 " f " A1 A 2 " \ _C 2 J _C 2 _ J _C 2 _ _A 3 A 4 _ "C1 " f " A1 A 21 _C 2 _ _A 3 A 4 _ = (cf C1 + c2 c 2)" .(cf C1 r1 [ck Ck (cfc1 r1 Cf : (C[C1 rc A 2 A 4 =[(cf C1 r-1 Cf A1+(cf C1 r-1 Cf A3 : (cf C1 r-1 Cf A 2 + (cf C1 r-1 Cf A4] = [(5)+(6) : (7)+(8)] (4) In the third line of the expression (4) it is assumed approximation(CfC1 + CfC2)~ CfC1. Such an approximation is justified by the fact that extensions of concept vectors are sparser than concept vectors formed from starting documents, because the coordinates of extended concept vectors are weights of added terms which were not included in list of the index terms before addition of new documents. It was established, by experiment, that 102 C 2 << Cf CJ The number 3 4 of operations is significantly reduced by this approximation, because inverse (c[ C1 ) is already computed during the computation of starting documents projection. This approximation is not necessary for the application of Method A, because this method does not use extensions of concept vectors. Representations of starting documents are given by expression (5), while representations of added documents are given by expression (7). Pre-processing of extended term-document matrix includes normalization of columns of matrices A1 (starting documents) and A 2 (added documents) to the unit length. Let us now calculate number of operations needed for application of Method A. Representations of starting documents are already known, and so is matrix (c[ C1 ) Cf . That is why the number of operations is equivalent to the number of operations needed for multiplication of matrices (cfC1 ) C[ and A2, which is 2m1kn2. By the Method B added documents are projected on the space of extended concept vectors. Vector representations of starting documents are already known, and they are given by the (5) , while representations of added documents are computed by the formula (Cf Ci )-1 Cf A 2 + a(Cl Ci )-1 Cf A4, (9) where coefficient a > 1 has a role of stressing the importance of added terms and documents. Preprocessing of extended term-document matrix includes normalization of its columns to the unit length. Performance of Method B demands computation of concept vectors' extensions and computation of added documents projections. Computation of the first summand in formula (9) demands 2m1kn 2 operations, while computation of the second summand demands (2k^m2+2m2n2k) operations, because inverse (Cf C1 ) is already calculated. Addition of matrix elements of the first and second summand and multiplication by scalar in the formula (10) demands 2n2k operations. Further, computation of concept vectors extensions by application of formula (2) demands (2n1km2+2n1k) operations. Normalization of columns of extended term-document matrix and concept matrix is not included in calculation of number of operations, because it is a standard operation of pre-processing included in every algorithm. That means that application of Method B demands Nb = 2m1kn2 + 2k 2 m1 + 2m2 n2 k + 2n1 km2 + 2n1k + 2n2 k = 2k (m1n 2 + km2 + m 2 n2 + n1m2 + n1 + n2) operations. 4 An example By this example [6] it will be shown, in an illustrative way, how documents are projected by CI method into the two-dimensional concept space. The collection of 19 documents (titles of books) will be used where 15 documents will form collection of starting documents and 4 documents will form the collection of added documents. The documents are categorized in three categories: documents from the field of data mining (DM documents), documents from the field of linear algebra (LA documents) and documents which combine these two fields (application of linear algebra on data mining). The documents with their categorization are listed in Table 1. A list of terms is formed from words contained in at least two documents of starting collection, after which words on the stop list are ejected and variations of words are mapped on the same characteristic form (e.g. the terms matrix and matrices are mapped on the term matrix, or applications and applied are mapped on application). As a result, a list of 16 terms is obtained which we have divided in three parts: 8 terms from the field of data mining (text, mining, clustering, classification, retrieval, information, document, data), 5 terms from the field of linear algebra (linear, algebra, matrix, vector, space) and 3 neutral terms (analysis, application, algorithm). Then we have created a term-document matrix from starting collection of documents and normalized the columns of it to be of the unit norm. This is a term-document matrix of starting documents in the space of starting terms Aj. Then we have applied CD (k=2) to that matrix. In CD C 2 Z * rows of concept matrix C 2 are representations of terms and columns of Z* are representations of documents of starting collection. We have also created two queries (underlined words are from the list of terms): 1) Q1: Data mining 2) Q2: Using linear algebra for data mining. For Q1 all data mining documents are relevant, while for Q2 documents D6, D18 and D19 are relevant. Most of the DM documents do not contain words data and mining. Such documents will not be recognized by the simple term-matching vector space method as relevant. Documents D6 and D19, which are relevant for Q2, does not contain any of terms from the list contained in the query. The representation of the query q by concept indexing will be q = (Cf Ck)-J Cf q and in the same way will be computed representations of added documents' collection (application of Method A). Number Status (Starting/Added) Categorization Document D1 Starting DM Survey of text mining: clustering, classification, and retrieval D2 Starting DM Automatic text processing: the transformation analysis and retrieval of information by computer D3 Starting LA Elementary linear algebra: A matrix approach D4 Starting LA Matrix algebra and its applications in statistics and econometrics D5 Starting DM Effective databases for text and document management D6 Starting Combination Matrices, vector spaces, and information retrieval D7 Starting LA Matrix analysis and applied linear algebra D8 Starting LA Topological vector spaces and algebras D9 Starting DM Information retrieval: data structures and algorithms D10 Starting LA Vector spaces and algebras for chemistry and physics D11 Starting DM Classification, clustering and data analysis D12 Starting DM Clustering of large data sets D13 Starting DM Clustering algorithms D14 Starting DM Document warehousing and text mining: techniques for improving business operations, marketing and sales D15 Starting DM Data mining and knowledge discovery D16 Added DM Concept decomposition of large sparse text data using clustering D17 Added LA A rank-one reduction formula and its applications to matrix factorizations D18 Added Combination Analysis of data matrices D19 Added Combination A semi-discrete matrix decomposition for latent semantic indexing in information retrieval Table 1: Documents and their documents). Documents D6, D18 terms are underlined. categorization (DM - data mining documents, LA - linear algebra and D19 are combination of these two categories. Words from the list of In Figure 1 are shown images of representations of documents and queries in the concept space. It can be seen that LA documents of starting collection are grouped (and located near x axes); DM documents of starting collection are somewhat more dispersed, but generally also grouped around y axes, while D6 document (combination) is in the group of LA documents. It appears that way because during the clustering by fuzzy k-means algorithm D6 document was clustered to group of LA documents. Namely, fuzzy k-means algorithm allows documents to belong to multiple clusters partially during the process of clustering, but the result of convergence are hard partitions, which means that at the end algorithm decide in which cluster document belong. Shaded areas on Figure 1 represent the areas of relevant documents for queries in the cosine similarity sense (cosine of the angle between points in shaded areas and representation of the queries is greater than 0.9). The added documents are shown on the figure in the shape that correspond to the category they belong, but in lighter colour then documents of the starting collection. By usage of the Method A document D16 (DM document) is Figure 1: Representations of starting and added documents in the concept space. Representations of added document are shown in the shape that correspond to category of document, but in lighter colour then representations of starting documents. Shaded areas are areas of relevant documents for queries. mapped in the group of DM documents and D17 document (LA document) is mapped near group of LA documents. Document D19 which combines fields of linear algebra and data mining is mapped near LA documents (because it is represented by index terms similarly as document D6) and document D18 which contains index term data also contained in the query Q2 is mapped in the area of relevant documents for Q2 query. 5 Experiment Experiments are conducted on MEDLINE collection of documents. The collection contains 1033 documents (abstracts of medical scientific papers) and 35 queries. The documents of collection are split randomly into two parts: starting documents and added documents. The ratio of starting and added documents is varied: first added documents form 10% of the whole collection, then 20% of the whole collection, and so on. Starting list of index terms is formed on the basis of starting collection of documents. In the list are included all words contained in at least two documents of starting collection, which are not on the list of stop words. Further, the list of index terms is formed for the whole collection of documents in an analogous way. The obtained list of index terms for the whole collection contains 5940 index terms. We have used measure of mean average precision (MAP) [1] for evaluation of the experimental results. Concept decomposition is conducted under starting collection of documents and added documents are represented in the concept space by using one of the described methods for approximate addition of documents. After that, an evaluation of information retrieval performance is conducted under the whole collection of documents. Dimension of the space of reduced dimension is fixed to k=75. In the first row of Table 2, there is MAP of information retrieval in the case that procedure of concept decomposition is conducted under whole collection of documents (percentage of added documents is 0%). This value presents MAP in the case of recomputation of concept decomposition when new documents are added in the collection. All other values of MAP in the cases when the collection is divided into collection of starting and added documents in the different ratios, could be compared to this value. The second column of Table 2 presents number of added documents, while the third column presents number of added terms. Let us note that number of added terms grows linearly, and that the collection with only 20% of starting documents is indexed with a much smaller set of index terms then the whole collection. The fourth row presents MAP for approximate addition of documents by Method A . Percentage of Number of Number of MAP MAP MAP added added added MAP Method B Method B Method B documents documents terms Method A a=1.0 a=1.5 a=2.0 0 0 0 54.99 54.99 54.99 54.99 10 104 456 51.98 52.20 52.33 52.37 20 208 753 54.96 55.10 55.09 55.23 30 311 1264 51.90 51.78 51.97 52.03 40 414 1673 50.84 50.60 51.09 51.64 50 517 2089 48.64 47.99 48.29 48.64 60 620 2696 44.26 44.08 45.04 45.49 70 723 3282 43.59 41.86 42.32 42.70 80 826 4024 39.87 40.56 42.56 43.74 Table 2: Mean average precision of information retrieval for approximate addition of new documents by Method A (without addition of new index terms) and Method B (with addition of new index terms) compared for different splits of document collection. Parameter a (used in Method B) has a role of additional stressing the importance of added terms and documents. The best results for every split of document collection are shown bolded. Generally, the best results are achieved for Method B, a=2.0, but these results are not significantly better in comparison to results obtained by Method A. The rest columns of Table 2 present MAP of information retrieval for approximate addition of new documents by Method B for different values of parameter a. The best results for every split of documents are show bolded. From the results we can conclude that an addition of new index terms does not improve results of MAP significantly. Namely, results obtained by Method B are better then results achieved by Method A and additional stressing of added terms and documents (for a>1) has positive effect on results. Nevertheless, results obtained by Method B, a=2.0 are not significantly better in comparison to results obtained by Method A according to pared t-test («=0.05). 6 Conclusions and future work Values of MAP for approximate methods are acceptable in comparison to repeated computation on concept decomposition when the number of added documents is the same or smaller than the number of starting documents. There is a drop of MAP when the number of added documents exceeds the number of starting documents. Results of MAP are not significantly improved by the methods that use extended list of index terms obtained as a result of addition of documents. It is interesting to notice that this statement is valid even in the cases when the list of index terms is significantly enlarged, which is when larger proportion of documents is added. This results show a great redundancy present in the textual documents. In the future we plan to develop new methods of approximate addition of documents that will correct existing concept vectors by using the representations of added documents. References [1] R. Baeza-Yates, B.Ribeiro-Neto. Modern Information Retrieval, Addison-Wesley, ACM Press, New York, 1999. [2] M. W. Berry, Z. Drmač, E. R. Jessup. Matrices, Vector Spaces, and Information Retrieval, SIAM Review, Vol. 41. No. 2, 1999, pp. 335-362. [3] M. W. Berry, S. T. Dumais, G. W. O'Brien. Using linear algebra for intelligent information retrieval, SIAMRewiew, Vol. 37. 1995, pp. 573-595. [4] S. Deerwester, S. Dumas. G. Furnas. T. Landauer, R. Harsman. Indexing by latent semantic analysis, J. American Society for Information Science, Vol. 41. 1990, pp. 391-407. [5] J. Dobša, B. Dalbelo-Bašić. Concept decomposition by fuzzy k-means algorithm, Proceedings of the IEEE/WIC International Conference on Web Intelligence, WI2003, 2003, pp. 684-688. [6] J. Dobša, B. Dalbelo-Bašić, Comparison of information retrieval techniques: latent semantic indexing and concept indexing, Journal of Inf. and Organizational Sciences, Vol.28 , No. 1-2, 2004, pp.1-17 [7] I. S. Dhillon, D. S. Modha, Concept Decomposition for Large Sparse Text Data using Clustering, Machine Learning , Vol. 42. No. 1, 2001, pp. 143175. [8] T. Kolda, D. O'Leary. A semi-discrete matrix decomposition for latent semantic indexing in information retrieval, ACM Trans. Inform. Systems, Vol. 16, 1998, pp. 322-346. [9] G.W. O'Brien. In formation Management Tools for Updating an SVD-Encoded Indexing Scheme, Master s thesis, The University of Knoxville, Tennessee, 1994. [10] J. Yen. R. Langari. Fuzzy Logic: Intelligence, Control and Information, Prantice Hall, New Jersey, 1999. A General Brokering Architecture Layer and its Application to Video on-Demand over the Internet Franco Cicirelli and Libero Nigro Laboratorio di Ingegneria del Software Dipartimento di Elettronica Informatica e Sistemistica Università della Calabria, I-87036 Rende (CS) - Italy E-mail: f.cicirelli@deis.unical.it, l.nigro@unical.it Keywords: service oriented computing, application framework, middleware, peer-to-peer, Internet, video on-demand, Java, Jini, Java Media Framework, RTP/RTCP protocols Received: February 7, 2006 GOAL -General brOkering Architecture Layer- is a service architecture which allows the development and the management of highly flexi ble, scalable and self-configurable distri buted and service-oriented applications over the Internet. GOAL centres on a design pattern which decouples the design of service functi onali ties from the distri bution concerns. A service wrapper is specifically responsi ble of the distribution aspects. The wrapper is weaved at runtime to its corresponding service by a dynamic proxy object. GOAL makes it possible to augment, in a transparent way, the behaviour of a software object in order to permit it to be remotely accessible. Current implementation of GOAL depends on Sun Microsystems' Jini as the brokering/middleware layer. The paper describes GOAL and demonstrates its practical use through the design and implementation of a Video on-Demand system. Java Media Frameworkis used forpumping multimedia data at the transmitter side and for rendering purposes at the receiver side, RTP/RTCP protocols are used for multimedia streaming. Povzetek: Predstavljen je GOAL - arhitektura za napredne spletne aplikacije, npr. video na zahtevo. 1 Introduction level abstraction entities defined by Service Oriented Architecture (SOA) [9, 10] in order to (i) characterize and Service Oriented Computing [1] emerged in the last decade organize service-based applications and (ii) capture the re-as a computing paradigm centred on the concept of ser- lationships existing among these entities. Basic entities in vice [2, 3] as basic building block. Services are suitable a SOA are the service provider, the service client, and the for developing and organising applications for large-scale service reSistry which acts as a broker among clients and open-environments. They are effective in improving soft- providers. Each service, offered by a Provider, prelimi-ware productivity and quality, as well as fostering system narily requires to be advertised in a registry in order for evolution and maintenance. A service is a coarse-grained it to become subsequently discoverable and utilizable by a software component virtualizing a hardware or software re- client (e.g. a human user or another service). source which is made exploitable for on-demand use. Bind- GOAL, the General brOkering Architecture Layer pro-ing to the service/resource typically occurs just at the time posed in this paper, is a novel service architecture allow-the component is needed. After usage the binding may be ing the development of highly flexible, dynamic and self-discarded. Applications, tailored according to user require- configurable service-based applications. GOAL aims at ments, are constructed through a combination and compo- simplifying the management of service lifecycle by reduc-sition [4] of independent, outsourced service components ing the burden of designing, developing and deploying soft-exposing a well-defined interface. Main features of the ser- ware objects suitable to work in a distributed context. Dif-vice paradigm include dynamism and transparency. The ferent distribution aspects like data consistency, fault tol-former refers to services which can appear or disappear in erance, security and remote communications are treated as a community without centralized control, possibly notify- cross-cutting aspects. In particular, the latter two concerns ing about their presence/absence. This behaviour depends are directly addressed by the system and are the respon-on the use of the so called discovery protocols [5, 6, 7]. sibility of proxy objects. GOAL offers a minimal frame-The latter feature means that services can be used with- work [11], easy to understand and use, and a few meta-out knowledge about service provider platforms and ser- services. A service design pattern, enforcing common vice provider locations. Service dynamism and interaction guidelines for service development, is provided. Designmodel strongly relate service architectures to peer-to-peer ing a new service does not introduce dependencies from a architectures [8]. Service computing relies on the high- particular API or system components. Weaving customer- objects and system-objects occurs only during service operation. Meta-services are system entities which allow one to publish, search and use customized services. A service, once advertised, becomes available within a GOAL community. Matching criteria can be specified during a searching phase. In addition, when a new service appears/leaves the community, interested clients can be automatically notified. The notification mechanism ensures system flexibility and scalability and permits the development of highly dynamic and self-adapting software. A service can leave the community due to an explicit removing operation or due to a crash. In the latter case, self-healing properties are carried out using a fail silent model [12] based on a leasing mechanism. Current implementation of GOAL depends on Jini [7, 13, 14] as the underlying service infrastructure, borrowing advantages of dynamic registration, service lookup, notification of remote events, distributed object access and platform-independence enabled by Java. However, the brokering layer, i.e. Jini, is fully abstracted and can possibly be replaced. Communication among services is based on the exchange of Java objects and fully exploits benefits of runtime code mobility [15]. GOAL can be used as the starting point for building further abstraction layers targeted to specific application domains. As a significant example, a Management Architecture for Distributed meAsureMent Services -MADAMS-[16] was developed directly on top of GOAL mechanisms. MADAMS is tuned to the requirements of distributed measurement systems [17,18,19]. MADAMS rests on the concept of measurement service as the basic abstraction entity modelling a (physical or virtual) measurement instrument, and the concept of connector which provides interinstrument communications. MADAMS also supports recursive service composition. MADAMS was successfully employed for demand monitoring and control [16] and for remote calibration of distributed sensors [20]. This paper describes GOAL features and the general design guidelines for achieving distributed services. As an application, GOAL capabilities are demonstrated through the development of a distributed Video on-Demand (VoD) system [21, 22, 23]. The VoD system depends on Java Media Framework (JMF) [24] which is responsible both for pumping multimedia data into a network connection at a sender side and for presenting multimedia data at a receiver side. Data streaming depends on RTP/RTCP protocols [25]. The structure of the paper is the following. Section 2 summarizes some related work. Section 3 presents the proposed service architecture along with its programming model. In particular, a description about the service design pattern, system security and the catalogue of meta-services is provided. Section 4 illustrates design and implementation and service catalogue concerning the prototyped VoD system. Finally, an indication of directions which deserve further work is furnished in the conclusions. 2 Related Work As with other technologies suited to the development of distributed systems, ensuring transparency and hiding management of distribution concerns allow developers to focus only on domain-specific problems. For these purposes, different infrastructures and middleware layers have been proposed centred on the service metaphor. Sun Microsystems' Jini [13, 7, 26] permits the construction of service-based applications in terms of fundamental mechanisms of service publication/discovery, leasing management, remote event notification and transaction support. In [27] an approach based on tuple-space [28] for building service frameworks is proposed. Concepts like actor, which execute client requests, virtual resource and virtual service are introduced. Virtual entities enable abstraction layers to be achieved on top of either physical resources or services thus ensuring a common and uniform way for accessing them. Spaces are used to manage (e.g. create, destroy, search) agents, services and resources. Other solutions are targeted to abstracting and hiding details of the adopted middleware/brokering layer in order to favour its interchangeability. By providing a well-defined set of components (i.e. interfaces and objects) and through code generation mechanisms, Iceni [29, 30] allows an automatic service management in different computing contexts such as Open Grid Service Infrastructure or Jini service community. In [31] a framework is proposed which hides behaviour of underlying transport layer and separates coordination patterns, i.e. request/response interactions, from computational logic, i.e. service functionalities. Colombo platform [32] introduces the concept of ser-vicelet as the unit of development and deployment. A ser-vicelet is a stateless object corresponding to a single service or to a collection of them. Context information are managed by specific Servicelet Context entities which are handled by the runtime system. Management of explicit metadata in the form of machine-readable service descriptions, including functional and non-functional QoS characteristics, is an important characteristic of Colombo. The goal is to avoid generating a gap between the internal representation of service capabilities and the external, interoperable service view which is defined by the service contract. Sirena framework [33] defines an architecture to seamlessly connect heterogeneous (resource constrained) devices and services furnished by such devices. Sirena comprises an incoherent set of tools having the responsibility of generating service stubs and skeletons, managing service lifecycle, supporting visual composition for service orchestration and so forth. A different goal is pursued in Arcademis [34] which is a Java-based framework enabling the implementation of modular and highly customizable middleware architectures for specific application domains. A distributed system built on top of Arcademis is structured according to three abstraction levels. The first level is constituted by basic components like invokers, which are responsible for emitting remote calls, or schedulers which are used to possibly order remote calls. Although these are abstract classes and interfaces, Arcademis also provides concrete components that can be used without further extensions. The second level is represented by the concrete middleware platform obtained from Arcademis basic components. The framework defers to this level decisions about serialization strategy, communication and lookup protocols that will be adopted. Finally, the third abstraction level is made up by components which make services available to end users. In the context of the above mentioned proposals, the original contribution of GOAL is twofold: (i) to allow development of new services without introducing, at design time, bindings to specific framework components (e.g. abstract classes or interfaces), (ii) to transparently handle distribution concerns as cross-cutting aspects. All of this fosters low coupling among entities, system evolution and maintenance in a natural way. 3 GOAL Service Architecture GOAL addresses all the activities involved in the lifecycle of services by exploiting a specific service design pattern and by using a set of well-defined system components and interfaces having specific roles. A main concern rests in encapsulating and hiding implementation details of core components by using stable interfaces so that if changes occur, e.g. in the middleware layer which is replaced or in the communication protocol stack, no consequence is induced in the implemented and working applications. GOAL components and features are discussed in the following. 3.1 Service Design Pattern The development of a generic service follows the service design pattern depicted in Fig. 1. Each remote service, i.e. publish instances of I dispatch method calls to I realize only at runtime | «interface» InvocationHandler (from:java::lang::reflect) << interface >> realize only at runtime I-------- dispatch method calls to wrap an instance of manage instances of ServiceWrapper +bootstrap():boolean +shutdown():boolean +setServiceInfo(info:ServiceInfo):void +getSe^iceLoadFactor():short +setServiceID(id:String):void getServiceInfo():ServiceInfo setExceptionListener(l:GOALEventListener):void setUser(user:UserACK):void getServiceID():String return a published GOALService <> ServiceFinder instanciate at runtime getServiceWrapper(service:Object,interface:Class):ServiceWrapper find and return a published GoallService object, if required a cast to the specific LocalService is allowed. Figure 1: Components of service design pattern. a new software object made available within a GOAL community, is first developed as a local object. This permits design efforts to concentrate only on the effective service functionalities. In this design phase, the only constraint to fulfil is in defining functional aspects of the new service by means of an interface. One such interface is shown in Fig. 1 as the LocalService interface. Interfaces allow a what-how decomposition [35] which ensure service client code immutability with respect to service implementation changes. Any object can be a candidate for a remote service because no restrictions are introduced in the definition of its functional behaviour except for the serializability of the parameters appearing in method signatures. Remote concerns are managed by means of a service wrapper. This kind of object extends local service behaviour with distribution aspects like transaction support and so forth. As a common guide line, the wrapper may enfold the local service and execute distributed tasks by interleaving them with method calls on the wrapped service. A service wrapper can require to be bootstrapped, for instance by initializing its internal state with information retrieved by contacting other services. A shutdown operation allows a wrapper to tear down, i.e. becoming out of work or unpublished. All of this is reflected in Fig. 1 by the ServiceWrapper abstract class. Other common functionalities allow: (a) setting the service identifier, (b) setting service info (e.g. a description of service behaviour and functionalities) and (c) managing an estimated load factor value of the service provider. By default, the above concerns are system addressed. When no special requirements have to be met, a DefaultWrapper can be transparently used. Would new functionalities be added to the local service, e.g. in order to cope with data consistency and integrity among multiple system nodes, an extension of the default wrapper may be supplied. At compile time, the local service interface and the relevant wrapper may be completely unrelated. A wrapper has to override only the local service operations whose capabilities require to be extended. Only at runtime, the wrapper and the local service behaviour will be weaved according to an aspect-oriented programming style [36]. Two problems arise when making a local service remotely accessible: (i) the service has to be advertised into a community, (ii) a representative object for the service, i.e. a GOAL proxy, is required to be downloaded on service client in order to support remote communications. Service advertisement is the responsibility of the ServicePublisher meta-service (see Fig. 1). Service finding and proxy download activities are in charge of the ServiceFinder meta-service (see Fig. 1). The proposed publisher/finder mechanisms help in hiding details about the actual brokering layer. Would the brokering layer be replaced, e.g. CORBA used instead of Jini, only the publisher/finder objects have to be correspondingly modified. While publishing a local service, behind the scene the service publisher (i) asks to a Wrap-perFactory for a wrapper instance, (ii) correlates service and wrapper with the proxy and (iii) makes the latter one object available to a GOAL community using functionalities of the actual brokering layer. The GOALProxy (see Fig. 1) is a remotely accessible object which, moving to a service client host, transparently acts as a dispatcher of request/response messages between remote user and local service provider. In the case the communication protocol changes, e.g. XMLRPC is preferred to Java RMI, only GOALProxy <> GOALService LocalService <> ServicePublisher use to get instances the proxy requires to be changed. By preferring the execution of overridden methods, the proxy realizes the interweaving among local service and the corresponding wrapper. At runtime, by exploiting Java dynamic proxy mechanisms [37], a GOALProxy is able to implement a list of specified interfaces without requiring code generation. Implementing a specific interface of a local service ensures that a generic client would not be able to perceive any difference between direct service usage with respect to proxy mediate usage. The sequence diagram in Fig. 2 summarizes the effects of the service design pattern on service publication and utilization. Figure 3, instead, depicts communication details characterizing interactions among service client and the associated service provider. A GOAL- ServiceProvider :Service Publisher :Wrapper Factory :Service Wrapper :GOAL Proxy :Service Finder ServiceClient ;//create an ^ | instance of 'publish the | instance of local service proxy advertisement "K occurs by using underlaid middleware functionalities 2.4;//configure and i bootstrap wrapper proxy retrievement occurs by using underlaid middleware functionalities \ 3;//get a s 2.5;//advertise proxy i;//retrieve service proxy ------ -J----J----^ 5;/Jdo something ur: 5.i.i;//do| something mething Figure 2: Sequence diagram capturing service utilization. LocalService y ServiceWrapper j eOALProxy (Skeleton) 1 <::;:>;;: Jini ;>:;:;;;;: r eOALProxy(Stub) JavaRMI TCP / IP ' Data flow Logical link ing node [20]. Proxy behaviour is defined during publication simply by setting some of the so called GOALSer-viceProperty(s). If no constraints appear in the object se-rializability or persistence, a service may be used according to remote or downloadable mode. A GOALProxy may be specialized in order to augment the capabilities of the overall system. For instance, to deal with fault-tolerance concerns, a proxy can be designed to abstract communications between a single client and a group of equivalent service providers, so as if one of the provider becomes unavailable or crashes, the client remains able, in a transparent way, to continue its work [38]. The class diagram of the service design pattern (see Fig. 1) makes also clear that, once published, a local service becomes a GOALService. GOALService interface defines a set of functionalities which are common to all services in a GOAL system. In particular, the setExceptionListener method is used to set a listener whose aim is to handle exceptions not thrown by local service methods but raised during remote operation. The setUser method is used to set a UserACK object especially devoted to cope with security concerns (see section 3.2). Remaining operations should be self explanatory. Other common issues of the service design pattern are user friendliness and load balancing support. Each service may possess a graphical user interface (GUI) obtained through a GUIfactory object previously published by using the service publisher (see also Fig. 6). Service finder is then used by clients in order to search and retrieve the factory. Advantages of this approach are: (i) a service is developed independently from its graphical interface, (ii) the GUI is instantiated only on the client side thus avoiding serialization of graphical objects, (iii) the GUI allows use of the service without any previous knowledge about it, (iv) multiple graphical user interfaces, possibly tied to different user node capabilities, can be supported. Load balancing is carried out by using the so-called remote service attributes. Every service has one of such an attribute that expresses its load factor, i.e. NORMAL for a low or normal load factor, and WARNING or BUSY for a high/very high load factor. Although services are usually supposed to remain context-free, remote attributes can provide a kind of context information [39, 40] exploitable during the finding phase. For instance, the service finder always tries to return services in the NORMAL state, if there are any, otherwise the first one matching searching criteria is returned. The service publisher keeps up to date remote attributes by periodically querying service wrappers state or (possibly) by monitoring the CPU usage on provider nodes. Figure 3: GOAL service usage scenario: communication details. Proxy may enforce the download of the entire service code on a user node. This is useful when the service must be executed on the client side, e.g. for accessing to hardware or software resources hosted on a particular comput- 3.2 Security Concerns Handling security is an essential issue in a distributed and multi-user scenario. The service code downloaded from a remote site requires to be trusted along with the remote site itself. User credentials must be verified, usage of system resources granted and resource accesses controlled. Authentication and authorization mechanisms of GOAL are im- plemented through the UserACK object (see Fig. 1) which permits user identification and acknowledgment of its roles (e.g. administrator or normal user), privileges and grants. The concept of user groups is introduced and users may become members of one or multiple groups. Each group, e.g. admin group, owns some GOAL permissions and the UserACK holds the union of all permissions relevant to the user joined groups. Information stored in a permission follow the same hierarchical schema adopted in Java package specifications. Creating a permission with a service package info and service name enables access to the corresponding service. By providing only package information, a grant is given to all services belonging to the package. A finer authorization control is achieved by specifying service method/function name(s) in the permission. The use of UserACK makes it possible, in a decentralized context, to accept or discard a user request. User grants are checked directly by GOAL proxies. Therefore, authentication and authorization concerns are transparently managed with respect to service functionalities and service implementation. During publication, a specific GOALServiceProperty can be used to state if the proxy has to enable or disable the management of security concerns, i.e. to state if the service has to be considered secure or public. In the case security aspects are to be explicitly managed by a service, the UserACK object must be transmitted as a parameter when invoking its methods. Users can freely propose new groups and create their own UserACK. However, only signed UserACKs and accepted groups can be effectively used. A system administrator signs a new UserACK and establishes its expiration time. Signed UserACKs cannot be modified by users: the system is able to recognize when a UserACK is corrupted, modified or just invalid (e.g. it expired). UserACK and group management is responsibility of the core downloadable Grant Management service whose GUI is shown in Fig. 4. A UserACK submission form is offered and group membership is achieved by choosing items from a list of already accepted groups. Inspection of group properties is allowed. Likewise to UserACK, a group submission form is also available. Submitting a new group requires the group name and the list of group permissions to be provided. A reserved area, offering an overall vision of existing UserACKs and groups, is under the control of the system administrators (see Fig. 5). New User-ACKs/groups can be signed/accepted and single permissions can be added/removed in a group as well as in a submitted UserACK. The accesses to a service can be also allowed or denied depending on other criteria. Load balancing aspects, service availability or service exclusive use may often be considered during the UserACK acquisition phase. Confidentiality and privacy can be ensured by using the Secure Socket Layer for service communications, whereas trust-ness can be achieved by exploiting Java standard security mechanisms relevant to remote code management [41]. Figure 4: Grant Management service GUI. 3.3 Meta-Service Catalogue GOAL meta-services are responsible for publishing, searching, retrieving or removing services from a community. Figure 6 portrays the UML class diagram of the publisher/finder services which depend only on interfaces. Actual objects are created by a singleton factory which ensures a coherent delivering according to the underlying middleware layer. To cope with security concerns, a valid UserACK is required when using meta-services. Only the method find(String):GOALService can be used without a UserACK. This provides a bootstrap mechanism exploitable by new users for contacting the service devoted to grant management in order to obtain the personal User-ACK. The advertisement process is carried out by requiring the service to publish and a list of GOALServiceProp-erty. These properties allow to specify descriptive and behavioural attributes. Descriptive attributes may be used, for instance, to provide service description. Behavioural attributes must be used to specify the name of the interface through which the service will be retrieved by clients, the wrapper to use and so forth. Following a successful invocation, the publish method returns the unique service identifier. A service may be published using different interfaces thus allowing multiple views of the same local object. Among service properties it is also possible to specify a service working directory which will be used, by the system, for storing persistent information like the service identifier. Service properties may also be labelled as searchable. In this case, properties may be specified as matching attributes during the finding phase. The publishSer-viceUIFactory method is used to publish the UIFactory of a specified service, unpublish is used instead to remove a service from the community. Finding a service requires the service name, i.e. its interface name, and (possibly) Figure 5: Administration panel of the Grant Management service. publish(service:Object,properties:GOALServiceProperties,user:UserACK):String publishServiceUIFactory(serviceID:String,factory:UIFactory,user:UserACK):void unpublish(serviceID:String, user:UserACK):void CoreServiceSingletonFactory +getSe^icePublisherQ:Se^icePublisher +getServiceFinderQ:ServiceFinder <> ServiceFinder find(se^iceClassName:String):GOALSe^ice find(se^iceClassName:String,se^iceIDs:String[],user:UserACK):GOALSe^ice findAllSe^ices(se^iceClassName:String,l:Se^iceListener,u:UserACK):GOALSe^ice[] findSe^iceUIFactory(se^ice:GOALSe^ice,user:UserACK):UIFactory ■ create and return an instance o Figure 6: Publisher/finder service design. service identifier information. Although the finding process is based on naming criteria, the matching can also occur when a published service implements any interface in a hierarchy. As a side-benefit, the use of textual name and the availability of service GUI enable usage of any published service without requiring specific Java code to be installed on the client node. The findAllService method allows service retrieval by bypassing the load balancing policy. If specified, a ServiceListener (see Fig. 6) notifies when a new searched service joins or leaves the community. Meta-services require their code to be pre installed on every GOAL node. 4 A GOAL-based VoD System The following describes a VoD system developed on top of GOAL. The VoD system consists of a service federation which permits publishing, searching and retrieving as well as streaming and rendering of multimedia contents. Java Media Framework [24] is used for pumping (at provider side) and rendering (at client site) multimedia data. Streaming of multimedia contents relies on the RTP/RTCP protocols [42]. First the service architecture is described, then the list of developed services for the VoD system is provided. 4.1 System Architecture The architecture of the achieved VoD is depicted in Fig. 7. It consists of five types of computing nodes having different roles in supporting VoD services. Nodes, and relevant services, can dynamically join or leave the system and when this occurs the other nodes are notified. Some StreamingNode BrowsingNode RenderingNode Intranet/Internet 3 SearchingNode TheatreNode Figure 7: Architecture of the GOAL-based VoD system. nodes may be duplicated: multimedia files and related descriptions are normally distributed across multiple streaming nodes. Other kind of nodes, instead, may be duplicated for fault-tolerance and load balancing issues. VoD services are ultimately requested and made it available to final users through user nodes (see Fig. 7). The architecture was designed so as to minimize code requirements on the user nodes. Here only some standard code like JMF and obviously GOAL meta-services code, is supposed to be statically available. All of this contributes to system evolution because, by exploiting the download of the service code, a user, on-demand, will always use the latest version of the various software components. A description of each node is provided in the following. Streaming nodes are media servers. They contain multimedia data files and associated descriptions (e.g. title, authors etc.). Streaming nodes enable to: (i) access the descriptions of media data; (ii) add/remove multimedia files; (iii) create and manage multimedia sessions (for data streaming and control) on behalf of end users. Browsing nodes respond to management functionalities. Relevant services offer a unified list of the multimedia content available on the various streaming nodes and allow users to organize multimedia data across them. The organization consists in adding/removing/modifying multimedia contents on different streaming nodes. create and an instance of Searching nodes portray a whole vision of all the existing multimedia contents by providing: (i) the unified list of available media files distributed across streaming nodes; (ii) searching facilities, e.g. for selecting specific movies; (iii) user profiles in order to tailor media information on a per user basis or to send notifications when relevant new media data come into existence; (iv) trace facilities about media utilizations like reviews, user preferences and so forth. Rendering nodes act as remote libraries from which user nodes can dynamically download the code required for rendering a video content, possibly by ensuring also receiver based QoS control, e.g. lip-sync [43]. Theatre nodes provide a service which is used as entry point for user interactions. In order to view a movie a user has to (i) searching it by using searching node functionalities, (ii) starting and managing multimedia sessions by using streaming node services, (iii) managing the rendering process on the user node by retrieving and using rendering libraries downloaded from a rendering node. All of this requires the utilization and the coordination of multiple VoD services which in turn are provided by different computing nodes. By using the service exported by a theatre node, a user obtains an holistic vision of the entire VoD system. In this way, issues concerning single service invocation and coordination are fully abstracted. 4.2 Service Catalogue SessionController and VideoFileManager Are specific of streaming nodes. SessionController negotiates and creates a multimedia session between a client node and a streaming server node, with distinct control and streaming bindings. The streaming binding is used for media data streaming, e.g. unicast, and relies on the RTP/RTCP protocols [25]. A negotiation phase is required for establishing port identifiers at both receiver and transmitter side. The control binding is TCP-based and is used for exchanging session control commands (e.g. play, rewind, pause and stop). SessionController does not require its functionalities to be extended for remote access. Therefore, the DefaultWrapper can be transparently used during the publication phase. SessionController service has a VCR-like GUI which is automatically made available at the end of the negotiation phase. The VideoFileManager service is mainly devoted to adding/removing media files to/from a specific streaming node and managing media file information, e.g. title, director and language. Information about duration, file encoding and so forth are automatically detected and made available by the service. Media information are stored in XML format. VideoFileManager also notifies a set of listeners when, for instances, a new movie is added. A complete list of available movies is also provided. In order to enforce data consistency, listeners require to be notified under transactional support. Transaction management is responsibility of the VideoFileManagerWrapper (see Fig. 8) and relies on the Jini transaction mechanism. The class getGUI(s:GOALSen/ice):Java.awt.Component notify(event:VideoEvent):void ]avax.swing.JFrame VideoFileManagerGUI Vide addNewVideoDescriptor(v:VideoDescriptor,file:String):t deleteVideo(video:VideoDescriptor):void modifYVideoDescriptorr(old:VideoDescriptor,new:VideoDesmptor):vc getVideoDescriptorResultO:StreamingVideoResult addVideoChangeUstener(l:VideoChangeUstener):RegistrationInfo removeVideoChangeListener(listener:VideoChangeUstener):void -1-A- VideoFileManagerWrapper +addVideoChangeLi5tener(l:VideoChangeListener):Regi5trationlnfo +removeVideoChangeLi5tener(l:VideoChangeLi5tener):void -^-T- I ServiceWrappper I _implement_only_at runtime dispatch method call to _ dispatch_method_call_ to_| _dispatch _m_ethod call_to_ _ implement only at runtime Figure 8: Class diagram of VideoFileManager and related entities. diagram in Fig. 8 makes clear that the wrapper and the corresponding service are unrelated at compile time, i.e. they do not implement or extend any common entity. As discussed in section 3.1, the weaving between the two objects is only established at runtime by means of a GOALProxy. During the bootstrap phase (see section 3.1), the wrapper registers itself as a listener of the VideoFileManager. Subsequently, it will act as a dispatcher of notifications coming from the wrapped service and going toward remote listeners. By overriding the methods ad-dVideoChangeListener and removeVideoChangeLis-tener, the wrapper obtains to be the exclusive manager of the RemoteVideoChangeListener(s) which are handled under transaction. A RemoteVideoChangeListener is a listener whose functionalities are extended to support notification and management of remote events. In addition, the remote listener behaves as a transaction participant when notified by a transaction client [26], i.e. a VideoFileManagerWrapper. To enforce self-healing properties, the registration of a remote listener is regulated by a lease. As one can see, all the methods reported in Fig. 8 makes no use of UserACK objects, this is because security concerns are transparently handled by the GOALProxy. Figure 8 also shows the relationship existing between VideoFileManager and its GUI. Figure 9 depicts the GUI of a VideoFileManager service while an upload of a new movie occurs. StreamSearcher It is specific of searching nodes and provides a searching facility allowing a uniform access to all the media data available on existing streaming nodes. A StreamSearcher enriches media data with information about user activities and collects user reviews and profiles. It acts as a listener of events coming from VideoFileManager, e.g. informing that a new movie has been added to the video library, or coming from other StreamSearcher, e.g. informing that another review was added. This is reflected in the class diagram reported in Fig. 10 where a StreamSearcher interface extends the VideoChangeListener interface. As one can see, GUIFactory is a GUI of nstantiate wrap an instance of : : GOALService GOALProxy of searching criteria for finding a movie. The wrapper acts either as a transaction participant or a transaction client. Only at transaction commit, data received from other nodes are transmitted to the enfolded service. Figure 9: VideoFileManager GUI. some methods of StreamSearcher require a UserACK object as parameter. Although security concerns are always managed by the proxy, one of such an object is required for tracing user's activities. Likewise to the VideoFileManager Figure 11: StreamSearcher GUI. Figure 10: Class diagram of StreamSearcher and related entities. service, a StreamSearcherWrapper (see Fig. 10) was introduced for guaranteeing consistency and integrity of the data exchanged with other services, i.e. VideoFileManager(s) and StreamSearcher(s). In this case, the wrapper extends the RemoteVideoChangeListener interface. In particular the wrapper registers itself as a listener of the enfolded service and as listener of the VideoFileManager and Stream-Searcher working in the service community. At the bootstrap phase (see Figg. 1 and 2) the wrapper is in charge of initializing its own media data repository. If other searching nodes are available, a data mirroring is performed, otherwise it has to contact all the VideoFileManager(s) in order to retrieve info about movies. Media data and the so called enriched media data (i.e. title, authors, user reviews, and so forth) are represented in XML and stored in an XML DBMS such as eXist [44]. Figure 11 shows an interaction with the StreamSearcher service with a specification It is specific of rendering nodes and, in the context of a multimedia session, it assists the audio/video rendering process on a client node. The rendering process can possibly be accompanied by a receiver-based QoS control filter [45]. Within an allowed end-to-end delay, one of such a filter separately buffers incoming audio/video packets, assembles media frames and synchronizes media frame presentation on the basis of their presentation time (this is achieved by elaborating either timestamps of RTP packets and report packets of the RTCP sender in order to periodically adjust the real-time clock of the receiver subsystem to the realtime clock of the sender). Too late arriving or corrupted packets are discarded. The filter is capable of controlling intra-medium jitter and inter-media skew directly affecting the lip-synch problem [43]. Rendering service allows management of volume and video zoom factor as well as shows reproduction time of the rendering movie. This service requires to be fully downloaded on client node and no remote functionalities are added. Browser Specific of Browsing nodes, this service allows the listing of the VideoFileManager available into the VoD system. This service is only for management purposes i.e. selecting a streaming node to administrate. Browser service requires to be fully downloaded on a client node and no remote functionality is added. Theatre Specific of Theatre nodes, this composed service provides added value to final user activities. Behind the scene a theatre asks for a searching service and for a rendering service as well as, once a movie is chosen, for the right session controller service in order to start and manage the incoming multimedia session. A negotiation phase is required between the SessionController and the Renderer for according IP addresses and port numbers. Theatre is a downloadable service which does not require remote functionalities to be added and the DefaultWrapper is used during its advertisement. The theatre service does not have an own graphical interface: it supports user interaction through the GUI(s) of the component services (see Fig. 12). Figure 12: Theatre service vision. 5 Conclusions The General brOkering Architecture Layer facilitates the development of general-purpose distributed service-based applications. GOAL is based on a service design pattern and makes an application development independent with respect to a specific middleware technology. By keeping a clear separation between local vs. remote concerns and by exploiting the service metaphor GOAL fosters software evolution and maintenance. Development time and design efforts as well as the initial background required for making an application remotely usable are very small. Management, though, of more complex distribution concerns like transaction support, requires a deeper knowledge about GOAL components and the underlying middleware layer. GOAL mechanisms have been successfully experimented in the realization of significant applications like distributed measurement systems [16, 20]. This paper reports about the achievement of a Video on-Demand system over the Internet. Current implementation of GOAL depends on Java/Jini technology. Directions of further work include the following: - specializing service proxies with the purpose of allowing interoperability between GOAL services and Web Services [46] - introducing design by contract [47] by foreseeing pre-conditions and post-condition to be transparently managed via service proxy when calling a service method - making it available functionalities for supporting long term transaction [48] by offering a coordinator core-service accepting a list of methods to be managed under transaction - adding management of non functional aspects [49], such as service availability, service response time and throughput, either in the service advertisement or within the finding process - extending the VoD system in order to support multicast and cooperative multimedia sessions [23]. References [1] M.P. Papazoglou and D. Georgakopoulos. Service oriented computing. Communications of the ACM, 46(10):24-28, 2003. [2] K. Bennett, P. Layzell, D. Budgen, P. Brereton, L. Macaulay, and M. Munro. Service-based software: the future for flexible software. In Proceedings of the Seventh Asia-Pacific Software Engineering Conference (APSEC'00), pages 214-221, Washington, DC, USA, 2000. IEEE Computer Society. [3] R. Perrey and M. Lycett. Service-oriented architecture. In Proceedings of the Symposium on Applications and the Internet Workshops (SAINT'03 Workshops), pages 116-119. IEEE Computer Society, 2003. [4] J. Yang and M. P. Papazoglou. Service components for managing the life-cycle of service compositions. Information Systems, 29(2):97-125, 2004. [5] E. Guttman, C. Perkins, J. Veizades, and M. Day. Service location protocol, version 2, rfc 2608. http:// www.ietf.org/rfc/rfc2608.txt. Accessed on October 2005. [6] UPnP. Universal plug and play device architecture. http://www.upnp.org/download/ UPnPDA10\_20000613.htm. Accessed on October 2005. [7] W.K. Edwards and W. Edwards. Core Jini. NJ: Prentice Hall, second edition, 2001. [8] R. Schollmeier. A definition of peer-to-peer networking for the classification of peer-to-peer architectures and applications. In Proceedings of the First International Conference on Peer-to-Peer Computing (P2P '01), pages 101-102, Lingkoping, Sweden, August 2001. IEEE. [9] M.P. Papazoglou. Service-oriented computing: Concepts, characteristics and directions. In Proceedings of the Fourth International Conference on Web Information Systems Engineering, pages 3-12. IEEE Computer Society, December 2003. [10] M. Shaw and D. Garlan. Software architecture: perspective on an emerging discipline. Prentice-Hall, 1996. [11] M.E. Fayad and D.C. Schmidt. Object-oriented application framework. Communications of the ACM, 40(10):32-38, 1997. [12] D. Cotroneo, C. Di Flora, and S. Russo. Improving dependability of service oriented architectures for pervasive computing. In Proceeding of the Eighth International Workshop on Object-Oriented RealTime Dependable Systems (WORDS'03), pages 7481. IEEE Computer Society, 2003. [13] Sun Microsystems. Jini network technology -specifications (v2.1). http://www.sun.com/ software/jini/specs/index.xml. Accessed on May 2006. [14] Jini network technology. com/software/jini/. 2005. http://www.sun. Accessed on October [15] A. Carzaniga, G.P. Picco, and G. Vigna. Designing distributed applications with mobile code paradigms. In Proceedings of the 19th International Conference on Software Engineering, pages 22-32. ACM Press, 1997. [16] F. Cicirelli, D. Grimaldi, A. Furfaro, L. Nigro, and F. Pupo. MADAMS: a software architecture for the management of networked measurement services. Computer Standards & Interfaces, 28(4):396-411, 2006. [17] D. Grimaldi, L. Nigro, and F. Pupo. Java based distributed measurement systems. IEEE Transactions on Instrumentation and Measurement, 47(1):100-103, 1998. [18] A. Furfaro, D. Grimaldi, L. Nigro, and F. Pupo. A measurement laboratory over the internet based on Jini. In Proceedings of the Twelfth IMEKO TC4, pages 479-501, 2002. [19] W. Winiecki and M. Karkowski. A new Java-based software environment for distributed measuring systems design. IEEE Transactions on Instrumentation and Measurement, 51(6):1340-1346, 2002. [20] F. Cicirelli, A. Furfaro, D. Grimaldi, and L. Ni-gro. Remote sensor calibration through MADAMS services. In Proceedings of the IEEE International Workshop on Intelligent Data Acquisition and Advanced Computing Systems: Technology and Applications (IDAACS'05), Sofia, Bulgaria, 2005. [21] L.A. Rowe, D.A. Berger, and J.E. Baldeschwieler. The Berkeley Distributed Video on-Demand System. In T. Ishiguro, editor, Proceedings of the Sixth NEC Research Symposium, pages 55-74. SIAM, 1995. [22] K.C. Almeroth and M.H. Ammar. The Interactive Multimedia Jukebox (IMJ): a new paradigm for the on-demand delivery of audio/video. In Proceedings of the Seventh International World Wide Web Conference (WWW7), pages 431-441, 1998. [23] G. Fortino and L. Nigro. ViCRO: an interactive and cooperative videorecording on-demand system over Internet MBone. Informatica, 24(1):97-105, 2000. [24] Java Media Framework. http://java.sun. com/products/java-media/jmf/index. jsp. Accessed on November 2005. [25] C. Crowcroft, M. Handley, and I. Wakeman. Internetworking Multimedia. UCL Press, London, 1999. [26] R. Flenner. Jini and JavaSpaces Application Development. SAMS, first edition, 2001. [27] J. Jang. An approach to designing reusable service frameworks via virtual service machine. In Proceedings of the Symposium on Software reusability (SSR'01), pages 58-66. ACM Press, 2001. [28] N. Carriero and D. Gelernter. How to write parallel programs. MIT Press, 1990. [29] N. Furmento, W. Lee, A. Mayer, S. Newhouse, and J. Darlington. ICENI: an open grid service architecture implemented with Jini. In Proceedings of the ACM/IEEE Conference on Supercomputing, pages 110. IEEE Computer Society Press, 2002. [30] N. Furmento, J. Hau, W. Lee, S. Newhouse, and J. Darlington. Implementations of a Service-Oriented Architecture on top of Jini, JXTA and OGSI. In Proceedings of the Second Across Grids Conference, pages 90-99. Springer-Verlag, 2004. [31] L. Fuentes and J.M. Troya. Towards an open multimedia service framework. ACM Computing Surveys (CSUR), 32(1):24-29, 2000. [32] F. Curbera, M. J. Duftler, R. Khalaf, W. A. Nagy, N. Mukhi, and S. Weerawarana. Colombo: Lightweight middleware for service-oriented computing. IBM Systems Journal, 44(4):799-820, 2005. [33] H. Bohn, A. Bobek, and F. Golatowski. SIRENA - service infrastructure for real-time embedded networked devices: A service oriented framework for Entropy-Driven Parameter Control for Evolutionary Algorithms Shih-Hsi Liu Department of Computer and Information Sciences University of Alabama at Birmingham Birmingham, AL 35294, USA liush@cis.uab.edu, http://www.cis.uab.edu/liush Marjan Mernik Faculty of Electrical Engineering and Computer Science University of Maribor 2000 Maribor, Slovenia marjan.mernik@uni-mb.si, http://lpm.uni-mb.si/mernik Barrett R. Bryant Department of Computer and Information Sciences University of Alabama at Birmingham Birmingham, AL 35294, USA bryant@cis.uab.edu, http://www.cis.uab.edu/bryant Keywords: entropy, evolutionary algorithms, exploration, exploitation, P PC e a Received: November 3, 2006 Every evolutionary algorithm needs to address two important facets: exploration and exploitation of a search space. Evolutionary search must combine exploration of the new regions of the space with exploitation of the potential soluti ons already identified. The necessi ty of balancing exploration wi th exploitation needs to be intelligent. This paper introduces an entropy-driven parameter control approach for exploring and exploiting evolutionary algorithms. Entropy represents the amount of disorder of the population, where an increase in entropy represents an increase in diversity. Four kinds of entropy to express diversity and to control the entropy-driven approach are discussed. The experimental results of a unimodal, a multimodal wi th many local minima, and a multimodal with only a few local minima functions show that the entropy-driven approach achieves good and explicit balance between exploration and exploitation. Povzetek: V članku je opisan adaptiven način krmiljenja raziskovanja in izkoriščanja v evolucijskih algoritmih, voden s pomočjo entropije. 1 Introduction tion is a process of using information gathered from the previously visited points in the search space to determine Evolutionary Algorithms (EAs) [2, 12] are a common term which regions might be profitable to be visited next. Ad- for solving problems with computers that uses models and ditionally, exploitation techniques are good at finding local mechanisms from biological evolution. Such nature in- optima. However, how is the balance between exploration spired EAs simulate evolution and its mechanisms such and expl'3itation aichiCTed in EAs? More importantly, how as selection, crossover, and mutation. Most well known can the balance be controlled? examples of EAs are Genetic Algorithms (GAs), Evolu- In EAs, the selection process, operators (e.g., crossover tion Strategies (ESs), Evolutionary Programming (EP), and and mutation), and population size establish a balance be- Genetic Programming (GP) [12]. They have been used tween the exploration and exploitation of the search space successfully for planning, design, simulation and identi- [6]. The selection process drives search towards the regions fication, controlling, classification, and for solving many of the best individuals. Hence, exploitation is done by se- other hard optimization problems. EAs are general purpose lection. However, Bäck [1] showed that the selection pro- search methods with good yet implicit balance between ex- cesses can control the level of exploration or exploitation ploration and exploitation. Exploration is a process of vis- by varying selection pressure. Higher selection pressure iting entirely new regions of a search space and seeing if pushes the search towards more exploitation and lower se- anything promising may be found in the regions. Exploita- lection pressure urges the search towards more exploration. A mutation operator randomly modifies individuals, with a given probability, and thus increases the structural diversity of the population. From this point of view, the mutation operator is more an exploration operator. Such an operator helps to recover the genetic diversity lost during the selection phase and to explore new solutions avoiding premature convergence. Conversely, mutation can also be seen as an exploitation operator, because most of the genetic material is preserved. However, note that in some EAs (e.g., evolution strategies) mutation has a much bigger exploration role than in genetic algorithms. The crossover operator combines two or more parents to generate better offspring. Such a combination can be derived from the idea that the exchange of information between good individuals will generate even better offspring. From this point of view, the crossover operator is more an exploitation operator. However, a good crossover operator should also generate individuals in the exploration zone. Directing the evolutionary process towards exploration or exploitation is also possible by population resizing [9]. With bigger population size, the search space is more explored than with smaller population size. Therefore, good balance between exploration and exploitation in EAs is achieved by selection, good mutation and crossover operators and by determining parameters (e.g., pm, Pc, tournament size, population size), which control mutation, crossover, and selection, respectively. There have been a variety of studies on determining the best control parameter values [4, 5]. The main problem is to find a set of control parameters, which optimally balances exploration and exploitation: if crossover and mutation rates are very high, much of the space will be explored, but there is a high probability of losing good solutions and of failing to exploit existing schema. If crossover and mutation rates are low, the search space is not explored. The population diversity is therefore rapidly decreasing and ending up in a premature convergence to a non-optimal solution. Despite that, many researchers believed that EAs are effective because of a good ratio between exploration and exploitation. In EAs, however, this ratio is implicitly controlled. In some other search techniques such as reinforcement learning [18], one has explicit control over exploration and exploitation. In EAs, one no longer has explicit and respective control over exploitation and exploration, because it is difficult to delimit exploration from exploitation. In this paper, an entropy-driven exploration and exploitation approach is presented. The exploration/exploitation of the search space is adapted on-line based on the current status of the evolutionary process. The on-line adaptation mechanism involves a decision process as to whether more exploitation or exploration is needed depending on the current progress of the algorithm and on the current estimated potential of discovering better solutions. This decision process is described in a metaprogramming fashion using a domain-specific language, PPCEA (Programmable Parameter Control for Evolutionary Algorithms) [10]. Because of space consideration, the paper only presents the experimental results using genetic algorithms. Experimenting the mutation role for balancing between exploration and exploitation in evolution strategies is our future work. The paper is organized as follows. Section 2 describes the related work. In Section 3, four kinds of entropy are introduced to control exploration and exploitation. Section 4 shows the experimental results on the benchmark functions. Finally, Section 5 presents the conclusion. 2 Related Work Optimal balance between exploration and exploitation has been mainly controlled by determining the best control parameter values. There are a variety of studies on this topic [5, 8, 10]. Recommendations on control parameters for a particular set of problems can be found in [4, 15]. In [5], an overview of this problem has been given, where the authors distinguish between parameter tuning and parameter control. Furthermore, methods for parameter control have been classified into deterministic, adaptive, and self-adaptive categories: the deterministic category adjusts parameters by deterministic rules; the adaptive category utilizes the feedback of the evolutionary process to control the direction and magnitude of parameters; and the self-adaptive category encodes parameters into individuals and undergoes mutation and recombination. An example of how to balance between exploration and exploitation by parameter control is described as follows. As soon as an algorithm approaches the optimum, the mutation step size must be decreased to balance the probability of generating a new successful point. A simple idea is to decrease the mutation step size s by a deterministic schedule such as st = so/t or st = ß* ■ so, where ß e (0,1). One of the earliest researchers that investigated entropy in EAs was Rosca [14], whose experiments showed that populations appeared to be stuck in local optima when entropy did not change or decrease monotonically in successive generations. Rosca used fitness values in a population to define entropy and free energy measure. Our work extends Rosca's in trying to find different ways to compute entropy in EAs. Moreover, using entropy as a diversity measure and metaprogramming parameter control by PPCea [10], we are able to control exploration and exploitation in an adaptable manner. The Diversity-Guided Evolutionary Algorithm (DGEA) [17] uses a distance-to-average-point measure to alternate between phases of exploration and exploitation. It can be expressed easily as a PPCea program. Moreover, DGEA does not use entropy as a measure for diversity. In [11], entropy is introduced into EAs for determining the optimal number of clusters. However, in this case the fitness function is entropy-based. 3 Entropy in Evolutionary Algorithms Entropy is a concept in thermodynamics, information theory, and statistical mechanics. The thermodynamic entropy S is a measure of the amount of energy in a physical system that cannot be used to do work. As such, it is also a measure of the disorder and randomness presented in a system. The entropy depends not only on the current state of the system, but also its history. Therefore, it is a state function of the parameters (e.g., pressure and temperature), which describe the observable macroscopic properties of the system. The macroscopic state of the system is defined by a distribution on the microstates that are accessible to a system in the course of its thermal fluctuations. Entropy S of the system is defined as: High Low S = -k^^ Pi In Pi (1) where kß is a physical constant known as Boltzmann's constant, i is the energy of microstate, and pi is the probability that it occurs during the system's fluctuations. The basic concept of entropy in information theory has to do with how much randomness there is in a signal or random event. Shannon [16] defines entropy in terms of a discrete random event x, with possible states 1..n as: It 1 It H{x) = Y,Pi log2(-) = ^Pi log2Pi (2) Pi H = -n-log2(-) = log2 n n n (3) Entropy Numbers of Classes (a) Numbers of Classes (b) Numbers of Classes (c) Figure 1: The relationship between entropy and the numbers and sizes of classes Figure 1 shows how the numbers and sizes of classes of a population affect entropy. High entropy in EAs reveals the presence of many unique fitness values, where the population is evenly distributed over those values, as shown in Figure 1 (a). Figure 1 (c) represents low entropy computed from a population which contains fewer unique fitness values as many individuals have the same fitness. Rosca [14] calculates entropy for a population by first placing fitness values into fitness classes Pi and counting the size of each fitness class. Pi is the proportion of the population occupied by the population partition i. Entropy is then defined as: ^^ Pi log2 Pi (4) Statistical mechanics explains entropy as the amount of uncertainty which remains about a system, after its observable macroscopic properties have been taken into account. For a given set of macroscopic quantities, such as temperature and volume, entropy is a function of the probability that the system is in various quantum states. The more states available to the system with higher probability, the greater the disorder and thus, the greater the entropy. If the system has only one possible state, there is no uncertainty, and the entropy of the system is zero. If the system has n possible states which are equiprobable (pì = I), the entropy is the highest: This paper extends [14] to experiment with entropy, using different flexible cases of fitness classes, to facilitate explicit balance between exploration and exploitation. As such, entropy represents also a succinct measure of diversity. Biological diversity refers to the differences between individuals in a population, which in nature imply structural (genotype) and behavioral (phenotype) differences. In EAs, identical genotypes produce the same fitness. Thus, a decrease in genotype diversity will necessarily cause a decrease in phenotype diversity. Hence, to measure entropy/diversity, one needs to define some structural measures. Such measures, however, might be computationally intensive in some instances of EAs (e.g., genetic programming) [3]. Fortunately, based on the described entropy/diversity relationship between genotype and pheno-type, such measures at the phenotype level are sufficient. Fitness P1 P2 P3 P4 P5 Figure 2: Fitness classes of linear entropy Figures 2, 3, and 4 show three new cases for defining fitness classes: - Linear: Assign a predefined yet changeable value to the number of fitness classes, n. For each generation, the interval between the best and worst fitness values is evenly partitioned into n sub-intervals as fitness È Fitness P4 P1 P3 P5 Figure 3: Fitness classes of Gaussian entropy T3 C Uì m (U ro c o o Sì o P8 P6 P4 P2 lii P10 P12 P14 _LÌ fitness classes are partitioned by individuals having the same phenotypes. p i is the proportion of a population occupied in the i*-^ partition. In our approach, pi is formalized as fj^ Popsize fi, where fi is the fitness Fitness P1 P3 P5 P7 P9 P11 P13 P15 Figure 4: Fitness classes of fitness proportional entropy classes (Figure 2). An individual whose fitness value is occupied in a specific sub-interval is classified into the corresponding fitness class. The concept of linear fitness classes is adapted from [14]. Changeable n and various upper and lower bounds of each generation (i.e., the best and worst fitness values) are the two key differences between our approach and Rosca's. Gaussian: The partition of fitness classes in this case is derived from Gaussian distribution, as shown in Figure 3. For each generation, fitness classes are "spread out" from the average fitness value (average) with the standard deviation (a). For example, the upper/lower bound of the first fitness class (P1 in Figure 3) is computed as average +/- a. The boundaries of the successive classes (P2 - P5) can be generalized as average +/- i*a, where i e Z+ and i < n/2. For each generation, the lower bound of the leftmost fitness class is less than or equal to the smallest fitness value, and the upper bound of the rightmost fitness class is larger than or equal to the biggest fitness value. Fitness proportional: The fitness proportional approach is a variation of Rosca' approach [14]. Rosca's value of an individual. pi is the criterion for categorizing fitness classes. As all individuals of a population have different pi (namely, different fitness values), the number of fitness classes n equals the population size (Popsize). If more than one individual has the same fitness value (i.e., pi = p j, where i = j ), p j ■ log2 p j is eliminated in the Equation (1) and n < Popsize. It is because two identical fitness classes are not needed, and the elimination complies with the definition of diversity. Figure 4 shows 15 fitness classes sorted by pi, each of which has one or more individuals occupied. The next section exercises linear, Gaussian, fitness proportional and Rosca entropies for the entropy-driven approach and compares the experimental results with the Fogarty [7], Schaffer [15], and 1/5 success rule [12] approaches. 4 Experiments Entropy-driven exploration and exploitation have been experimented with on the suite of test functions presented in [19]. Due to lack of space, only examples of the Sphere Model (fi), generalized Rastrigin's function (fg), and Branin function (f 17) are presented in this section. To illustrate all the experiments easily, Best fitness value (B), Average fitness value (A), Worst fitness value (W), Population Diversity (D), Standard Deviation (S), Linear Entropy (E), Gaussian Entropy (G), Fitness Proportional Entropy (P), and Rosca Entropy (R) with respect to a population from generations 0 to maximum generation (X-axis) are included in the following figures. Curves B, A, and W use the same definitions as all other EAs; curves E, G, and P are defined in Section 3; curve S is the standard deviation of the fitness values of all individuals; curve D is the Euclidean distance between all individuals; and curve R is the entropy defined in [14]. All but entropy curves (E, G, P, and R) use the left Y-axis as the coordinate. Table 4 shows the initial values setup (we used the same setting as in [19]) for the following experiments: f1, fg, and f 17 have different maximum generation (Maxgen) settings; Popsize is the population size; pm and pc are mutation and crossover rates; Epoch is the stride of parameter adjustments during the evolutionary process; and Round is the number of experiments for each example, and the experimental results in subsequent figures are the average values out of 50 rounds. Sections 4.1, 4.2 and 4.3 respectively present fi, fg, and f 17 with their experimental results of the Fogarty [7], Schaffer [15], 1/5 success rule [12], and entropy-driven approaches. Only two figures of each function are selected in the paper. All of the experimental results with the corresponding figures may be found at the PPCea website [13]. 1.00E+06 1.00E+05 1.00E+04 1.00E+03 1.00E+02 1.00E+01 1.00E+00 1.00E-01 1.00E-02 1.00E-03 1.00E-04 1.00E-05 1.00E-06 1.00E-07 1.00E-08 4X- 0.00 1500 Figure 5: 1/5 success rule approach for /i Parameter Value Parameter Value Maxgen (/1) 1500 Maxgen (/9) 5000 Maxgen (/17) 100 Popsize 100 pm 0.005 pc 0.75 Epoch 50 Round 50 Table 1: Initial values of parameters in experiments on functions /1, /9, and /1r 4.1 The Sphere Model The Sphere Model (/1) is a unimodal function as shown in Equation (5). (5) where x, e [-100, 100], d (dimension) = 30, andmin(/1) = /i(0,...,0) = 0. The first presented experiment is the parameter tuning approach using the Schaffer parameter setting (pm = 0.005 and pc = 0.75). The mean best value and convergence rate1 are 6.82 • and 830, respectively. The Fogarty approach is a deterministic one that initializes pm = 0.11375 and adjusts the value using the Fogarty formula [7]. The mean best value is 2.13 • 10^5 at generation 765. Figure 5 presents the results using the 1/5 success rule [12]. Such a rule determines pm to be increased when the successful permutation rate is greater than 1/5, and to be decreased when ^ is less than 1/5. In Figure 5, a good balance between exploration and exploitation (yet still more on ex- 1 The point that curve Best becomes flat in the figure. ploration) can be found before generation 900: curves E and R are stable in the ranges between 1.4 and 1.65 and between 1.55 to 2.00, respectively; curves B, A, W, S, and D are smoothly decreased; and pm is changed every 50 generations to adjust the mutation step. From generations 900 to 1220, curves E and R steeply decline, and curve G has downhill move. Such curves show that the evolutionary process is inclined from exploring to exploiting the current regions with relatively small mutation steps. From generations 1220 to 1320, all entropy curves are getting flat and curve D has a "saw-toothed" shape. Such curves imply that the searching process is in the exploitation phase and is not stuck in local optima. The best value found using the 1/5 success rule approach is 6.82 • 10^8 at generation 1274. Before examining the last chart, an entropy-driven approach written in PPCea is shown in Figure 6. When entropy is greater than 0.5, pm is decreased to facilitate the exploitation phase. Smaller mutation steps avoid the increase of population diversity. As entropy is smaller than 0.5, more exploration is required to avoid local optima. Therefore, pm is increased to diversify the search regions. Such an example shows that balance between exploration and exploitation can be adjusted in synergy of entropy and pm. Figure 7 shows the result using this source code. In Figure 7, curves E, P, and R steeply decline between generations 0 and 450. Curves B, A, W, S, and D also diagonally traverse the plane. When curve E is between its midpoint (at generation 350) and upper bound (0.74 to 1.68), pm is decreased (line 9 of the PPCea code) to balance exploitation against exploration. As curve E is between its lower bound and midpoint (0 to 0.74), exploration outperforms exploitation by raising pm. This phenomenon can be observed from curve D that declines more steeply and has a 2.00 1.60 1.20 0.80 0.4C 0 Figure 7: Entropy-driven approach for /i 1 genetic 2 3 4 g := 0; while ( g < Round ) do 5 6 7 8 9 10 11 12 13 14 15 16 17 18 t:=0; init; while ( t < Maxgen ) do callGA; if ( Entropy > 0.5 ) pm := pm * 0.82 fi; if ( Entropy < pm := pm * fi; t := t + Epoch 5) 22 then then end; g:= g + 1 end end genetic 4.2 Generalized Rastrigin's Function Generalized Rastrigin's Function (/g) is a multimodal function with many local minima as shown in Equation (6). /g(x)^[x2 - 10cos(2^Xi) + 10] (6) Figure 6: Entropy-driven parameter control written in PPCea drastic "saw-toothed" shape from generations 400 to 500. Curve R is similar to curve E in terms of the shapes and the balance between exploration and exploitation. The best value found is the same as in the 1/5 success rule. However, please note that the convergence is much faster in the entropy-driven approach (at generation 467). Hence, many fitness evaluations after 467 generations can be skipped. where xi e [-5.12, 5.12], d (dimension) = 30, andminC/g) = /g (0,...,0) = 0. For the Schaffer approach for /g (figure in [13]), exploration is still carried out energetically after generation 1000 in the search space comprising many local minima. Because of the late vivid exploration, the best optimal solution is still improved slightly (20.86) until the evolutionary process converges at generation 3988. Figure 8 (i.e., the experimental results of the Fogarty approach) is a good example to represent that the process is stuck at the local minima. The figure shows that pm may decrease too fast to perform enough exploration. After generation 400, entropy curves (i.e., curves E, G, P, and R) and fitness curves (i.e., curves B, A, and W) are nearly static, yet diversity curves (i.e., curves D and S) exhibit extreme shakiness. This phenomenon implies that even though the exploration is still active, the relatively small pm does not provide enough exploration power to assist the evolutionary process to jump out the local optima. Hence, the experimental results of the Fogarty approach are the worst among the four approaches (40.55 at generation 4079). The characteristic of exploiting many local minima can be also examined in the results of the 1/5 success rule (figure in [13]). However, because of the inefficient exploration power determined by the small pm value at the later stage, there is no exploration or exploitation activity 0 500 1000 1500 2000 2500 3000 3500 4000 «■■lBll.0 4500 5000 Figure 8: Fogarty approach for fg El L S I mrnui Figure 9: Entropy-driven approach for fg P 1.6 1.2 0.8 0.4 0 0 observed. The mean best value and convergence rate are 25.24 and 1265, respectively. Figure 9 shows a good case of balance between exploration and exploitation using the entropy-driven approach in the case of multimodal function. In the chart, the evolutionary process starts at inclining from more exploration toward more exploitation driven by declining pm before generation 320. From generation 300 to 850, the rising pm facilitates more exploration to discover many local minima. In this phase, the same or better values may be found and selected. Entropy curves and diversity curves are therefore updated drastically. Most importantly, because of the exploration on the search space of local minima, fitness values are still slightly improved (23.99) at the very late phase (generation 3023). 4.3 Branin Function The Branin Function (f 17) is a multimodal function with only a few local minima as shown in the following equation. fi7(x) = [x2 - {5.lxi)/{4n2 +5xi/n - 6] +10[1 - l/(8n)]cosxi + 10 2 (7) where xi e [-5, 10] and x2 e [0, 15], d (dimension) = 30, and min(f 17) = f 17 (0,...,0) = 0.398. Because there are only a few local minima in f 17 given a small maximum generation number, the evolutionary process cannot be guaranteed to discover all of the local optima using the Schaffer approach (i.e., parameter tuning problem). In Figure 10, diversity curves appearing again after generations 30, 66, 76, 83 and 90 show that a few local optima are found in this phase. Fortunately, the evolutionary process still possesses enough exploration power to improve the value of mean best value (0.421 at generation 90). Similar to the Schaffer results, the Fogarty approach for f 17 also generates small refinements for the mean best value (0.432) at the late stage (generation 80). However, the slightly different results between the two approaches may be derived from the early decreasing pm in the Fogarty approach. Please refer to [13] for the enlarged Figure 10 and the numerical improvement of mean best value that may not be observed in Figure 10. For the 1/5 success rule, because the success mutation ratio is always below an ideal value, 0.2, the entire process inclines towards exploitation by reducing pm. The mean best value (0.434) is close to the Fogarty approach. However, because of different formulae for adjusting pm, the 1/5 success rule converges at generation 59, which is much earlier than the Fogarty approach. Although f 17 has a few local maxima, the entropy-driven approach still performs a good balance between exploration and exploitation as well as finding even better solutions at the end of the evolutionary process. Figure 11 presents similar characteristics (i.e., rising pm, drastic changing entropy curves, and decreasing fitness value curves) as Figure 10. The mean best value is 0.398 at generation 100. The experimental results on all benchmark functions indicate that the linear and Rosca approaches for defining fitness classes are superior to Gaussian and fitness proportional ones in terms of providing the information for balancing exploration and exploitation. 5 Conclusion and Future Work The opinions on the research on exploration and exploitation are still widely divided [5]. In this paper, we introduce a novel entropy-driven exploration and exploitation approach. The balance between exploration and exploitation is fulfilled by the synergy of pm, pc and entropy online. The on-line adaptation mechanism involves PPCea as to whether more exploitation or exploration is needed depending on the current progress of the algorithm and on the current estimated potential of discovering better solutions. The experimental results in all figures show that our approach can easily interpret the influence of exploration and exploitation using curve E and auxiliary curves. Experiments with the entropy-driven exploration and exploitation approach for evolution strategies [12] are planned. Additionally, a more generic PPCEA that manipulates more similar related work (e.g., Harik's parameter-less genetic algorithm [9]) will benefit the community of evolutionary computation. References [1] T. Bäck. Selective Pressure in Evolutionary Algorithms: A Characterization ofSelection Mechanisms. Proc. 1st IEEE Conf. on Evolutionary Computing, pages 57-62, 1994. [2] T. Bäck, D.B. Fogel, and Z. Michalewicz. Handbook of Evolutionary Computation. University of Oxford Press, 1996. [3] E. Burke, S. Gustafson, G. Kendall, and N. Krasno-gor. Advanced Population Diversity Measures in Genetic Programming. Parallel Problem Solving from Nature - PPSN VII, Springer-Verlag LNCS, No. 2439, pages 341-350, 2002. [4] K. De Jong. The Analysis of the Behavior of a Class of Genetic Adaptive Systems. Ph.D. thesis, Department of Computer Science, University of Michigan, Ann Arbor, Michigan, 1975. [5] A. Eiben, R. Hinterding, and Z. Michalewicz. Parameter Control in Evolutionary Algorithms. IEEE Trans. on Evolutionary Computation, Vol. 3, No. 2, pages 124-141, 1999. [6] A. Eiben and C. Schippers. On Evolutionary Exploration and Exploitation. Fundamenta Informaticae, No. 35, pages 35-50, 1998. Figure 10: Schaffer approach for f 17 Figure 11: Entropy-driven approach for f 17 [7] T.C. Fogarty. Varying the Probability of Mutation in the Genetic Algorithm. Proc. 3rd Intl. Conf. on Genetic Algorithms, pages 104-109, 1989. [8] J.J. Grefenstette. Optimization of Control Parameters for Genetic Algorithms. IEEE Trans. on Systems, Man & Cybernetics SMC-16, No. 1, pages 122-128, 1986. [9] G. Harik and F. Lobo. A Parameter-less Genetic Algorithm. Technical Report IlliGAL 9900, Illinois Genetic Algorithms Laboratory, University of Illinois at Urbana-Champaign, 1999. [10] S.-H. Liu, M. Mernik, and B.R. Bryant. Parameter Control in Evolutionary Algorithms by Domain-Specific Scripting Language PPCea. Proc. Intl. Conf. on Bioinspired Optimization Methods and their Applications, pages 41-50, 2004. [11] W. Lu, and I. Traoré. A New Evolutionary Algorithm for Determining the Optimal Number of Clusters. Proc. 17th Intl. Conf. on Tools with Artificial Intelligence, pages 712-713, 2005. [12] Z. Michalewicz. Genetic Algorithms + Data Structures = Evolution Programs. 3rd ed., Springer-Verlag, 1996. [13] PPCea: A Domain-Specific Language for Evolutionary Algorithms. http://www.cis.uab.edu/liush/PPCea.htm [14] J. Rosca. Entropy-Driven Adaptive Representation. Proc. of the Workshop on Genetic Programming: From Theory to Real-World Applications, pages 2332, 1995. [15] J.D. Schaffer et al. A Study of Control Parameters Affecting Online Performance of Genetic Algorithms for Function Optimization. Proc. 3rd Intl. Conf. on Genetic Algorithms, pages 51-60, 1989. [16] C. Shannon. A Mathematical Theory of Communication. Bell Systems Technical Journal, Vol. 27, pages 379-423, 623-656, 1948. [17] R. Ursem. Diversity-Guided Evolutionary Algorithms. Parallel Problem Solving from Nature - PPSN VII, Springer-Verlag LNCS, No. 2439, pages 462471,2002. [18] S. Whitehead. Learning from Delayed Rewards. Ph.D. thesis, King's College, Cambridge University, England, 1992. [19] X. Yao, Y. Liu, and G. Lin. Evolutionary Programming Made Faster. IEEE Trans. on Evolutionary Computation, Vol. 3, No. 2, pages 82-102, 1999. Stopping Criteria for a Constrained Single-Objective Particle Swarm Optimization Algorithm Karin Zielinski and Rainer Laur Institute for Electromagnetic Theory and Microelectronics, University of Bremen, Germany {zielinski,rlaur}@item.uni-bremen.de Keywords: constrained single-objective optimization, evolutionary algorithms, particle swarm optimization, stopping criteria Received: November 3, 2006 When using optimization algorithms the goal is usually clear: The global optimum should be found However, in general it is not clear when this goal is achieved, especially if real-world problems are optimized for which no knowledge about the global optimum is available. Therefore, it is not easy to decide when the execution of an optimization algorithm should be terminated. Although different mechanisms can be used for the detection of an appropriate time for ending an optimization run, only two of them are frequently used in the literature. Unfortunately, both methods have disadvantages, particularly for the optimization of real-world problems. Because especially for practical applicati ons i t is important when an optimization algorithm is terminated as they usually contain computationally expensive objective functions, the performance of several stopping cri teria that react adaptively to the state of an optimizati on run is evaluated for a Particle Swarm Optimization algori thm in this work. The examinati on is done on the basis of a constrained single-objective power allocation problem. Suggestions from former work concerning stopping criteria for unconstraned optimizati on are verified and comparisons with results for Differential Evoluti on are made. Povzetek: Ovrednoteni so ustavitveni kriteriji za optimiranje z roji delcev (angl. particle swarm optimization) in rezultati primerjani z rezultati algoritma diferencialne evolucije. 1 Introduction so it is generally not applicable to real-world problems because the optimum is usually not known a priori. The sec- Evolutionary algorithms (EAs) are a class of population- ond method is highly dependent on the objective function. based stochastic optimization algorithms that incorporate Because generally no correlation can be seen between an mechanisms from evolution for optimization processes. optimization problem and the required number of function The most famous representatives from this class are pos- evaluations, femax has to be determined by trial-and-error sibly Genetic Algorithms [5] but in the last years also e.g. methods usually. Evolutionary algorithms include random- Particle Swarm Optimization (PSO) [7] and Differential ness in the optimization process, thus the number of func- Evolution (DE) [9] had a lot of success. tion evaluations that is needed for convergence is subject For theoretical aspects of evolutionary algorithms stop- to fluctuations, so a safety margin for fe^ax is needed. ping criteria are usually not important. However, for prac- The fluctuations can be significant as can be seen in [17] tical applications the choice of stopping criteria can signif- where a test suite of 24 functions has been examined, and icantly influence the duration of an optimization run. Due the standard deviation of function evaluations for reaching to different stopping criteria an optimization run might be a predefined error measure was up to 180,000. If a real- terminated before the population has converged, or compu- world problem with an unknown optimum would result in tational resources might be wasted because the optimiza- a similar standard deviation, it would be difficult to choose tion run is terminated late. Real-world problems mostly femax. contain computationally expensive objective functions that may result in optimization runs that take several days, thus As,f re,sult, i,t would be bftter to use sstopping criteria that wasting of computational resources has to be prevented. c?nsider knowledge from the state of the optimization run. The time of termination would be determined adaptively, In the literature mostly two stopping criteria are applied so function evaluations could be saved. in single-objective optimization: Either an error measure in dependence on the known optimum is used or the number Several stopping criteria are reviewed in [19] and [20] of function evaluations is limited to femax. These criteria that are sensitive to the state of the optimization run by obare perfectly suitable for e.g. comparing the performance serving the improvement, movement or distribution of the of different algorithms but for solving real-world problems population members. In [19] stopping criteria are tested for there are some drawbacks. The first mentioned method unconstrained single-objective optimization using Particle has the disadvantage that the optimum has to be known, Swarm Optimization and Differential Evolution, while in [20] the criteria have been adapted for constrained single-objective problems using DE because real-world problems normally include constraints. In this work it will be examined if the suggestions regarding stopping criteria for PSO from [19] hold for the constrained real-world problem of optimizing a power allocation scheme. Additionally, a comparison with the results for DE in [20] will be done. This work is organized as follows: In Section 2 related work is discussed. In Section 3 the Particle Swarm Optimization algorithm is described and Section 4 provides a short introduction to Differential Evolution. In Section 5 the stopping criteria that are used in this work are reviewed. In Section 6 results are shown and Section 7 closes with conclusions. 2 Related Work Every optimization algorithm includes a stopping rule but there are only few works concentrating explicitly on stopping criteria. In [16] convergence of a Particle Swarm Optimization algorithm is detected by computing a maximum swarm radius, by doing a cluster analysis or by calculating the rate of change in the objective function. Most stopping criteria are applicable not only to PSO but also to other population-based optimization algorithms, e.g. in [1] the difference between maximum and minimum objective function value is used as stopping criterion for a Differential Evolution algorithm. In [13] not only termination criteria for evolutionary algorithms but also for other optimization algorithms are discussed. Often criteria similar to the ones used in the work are also applied in hybrid algorithms to determine the moment when global search should be replaced by local search [4, 6, 15]. 3 Particle Swarm Optimization Particle Swarm Optimization is derived from the behavior of social groups like bird flocks or fish swarms. Although the "survival of the fittest" principle is not used in PSO, it is usually considered as an evolutionary algorithm. A thorough discussion of this topic can be found in [7]. Like in this work, PSO is mostly used for the optimization of continuous functions. Optimization is achieved by giving each individual in the search space a memory for its previous successes, information about successes of a social group and providing a way to incorporate this knowledge into the movement of the individual. Therefore, each individual (called particle) is characterized by its position Xj, its velocity Vi, its personal best position pi and its neighborhood best position p g. Several neighborhood topologies have been developed [10]. In this work the von-Neumann topology is used as it showed promising results in the literature, e.g. in [8]. The dynamic behavior of PSO is generated by the update equations for velocity and position of the particles: Vi{t +1) = w ■ Vi{t) (1) +ciri[i)i(t) - Xi{t)] +C2r2[Pg (t) - Xi{t)] Xi(t +1)= Xi(t) + Vi(t + 1) (2) Due to these equations the particles are drawn towards their personal best position and their neighborhood best position, and furthermore the velocity of the previous iteration is kept weighted with the inertia weight w. Other parameters are ci and c2 which influence the impact of the cognitive and social component, respectively. To add a stochastic element to the movement of the particles, the numbers ri and r2 are chosen randomly from the interval [0,1] in each iteration. Further parameters of PSO are the population size NP and the maximum velocity Vmax that is used for preventing oscillations with increasing magnitude [7]. The control parameter settings for this examination are derived from a parameter study using the same optimization problem (yet unpublished): w = 0.6, c1 = 0.4, C2 = 1.4, NP = 64, Vmax = 11 (Xmax - Xmin). Constraint-handling is done by modifying the replacement procedure for personal and neighborhood best positions [11]. In unconstrained single-objective optimization a personal or neighborhood best position is replaced if the current position has a lower objective function value (for minimization problems as in this work). For constrained single-objective optimization this rule is altered so that in a comparison of two solutions a and b, a is preferred if - both vectors are feasible and a has a better objective function value or - both vectors are infeasible and a has the lower sum of constraint violation or -a is feasible and b is infeasible where feasibility means that all constraints are satisfied. In contrast to several other constraint-handling techniques, no additional parameters are needed for this method [2]. For unconstrained problems the modified PSO algorithm is exactly the same as the original PSO. 4 Differential Evolution The main characteristic of Differential Evolution is an adaptive scaling of step sizes that results in fast convergence behavior. Using DE the population members are evolved from one generation to the next by applying the operators mutation, recombination and selection. The first two operators generate new vectors by linearly combining several population members and afterwards exchanging some vector components. The third operator decides based on objective function values and constraint violation which vectors will be kept for the next generation. Because no deterioration with regard to the objective function value is possible, the DE selection scheme is called greedy [14]. More specific information about the here mentioned DE algorithm can be found in [20]. 5 Stopping Criteria Stopping criteria are needed to terminate the execution of optimization algorithms. In contrast to using a maximum number of function evaluations as a stopping condition, other criteria have the advantage of reacting adaptively to the state of the optimization run, thus function evaluations can be saved. Unfortunately, it seems to be impossible to define a stopping criterion without introducing one or more parameters. The parameter settings generally depend on the given optimization problem. However, it should be investigated if there are stopping criteria for which the parameter settings are robust to changes or if parameters can be set depending on certain aspects of the problem. It is assumed that the general behavior of different optimization problems to stopping criteria is similar. It should be kept in mind that limiting the number of function evaluations as a stopping criterion also incorporates the choice of a problem-dependent parameter femax. Hence, it is favorable to examine other possibilities for stopping that contain the advantage of reacting adaptively to the state of the optimization run. In the following the stopping criteria that incorporate information about the state of the optimization run are reviewed shortly. Note that there is a change compared to [19]: Instead of using the current positions xi for the calculation of stopping conditions, the personal best positions pi are used here. The reason is that the current positions have many fluctuations whereas the development of the personal best positions is more smooth, so decisions about termination of an optimization run should be easier. Improvement-based criteria terminate an optimization run if only small improvement is made because usually in the beginning of an optimization run large improvements are achieved whereas in later stages the improvement becomes small. Three different conditions are used here: - ImpBest: The improvement of the best objective function value is monitored. If it falls below a given threshold t for a number of generations g, the optimization run is terminated. - ImpAv: Similar to ImpBest, but instead of observing the best objective function value, the average value computed from the whole population is checked. - NoAcc: It is observed if any new pi are accepted in a specified number of generations g. For DE this criterion is slightly different because in DE there are no personal best positions (instead, the acceptance of new population members is considered). For movement-based criteria not the improvement but the movement of individuals is regarded. Two variants of movement-based criteria are considered that differ in the regarded space: - MovObj: The movement of the individuals with respect to their objective function value (objective space) is examined if it is below a threshold t for a number of generations g. MovObj is different from ImpAv only if the regarded algorithm allows deterioration of the individuals' objective function value. This is the case for PSO in contrast to DE, but as pi are considered here instead of Xi, MovObj = ImpAv holds in this case also. Therefore, this criterion is not regarded further in this work. - MovPar: The movement with respect to positions (parameter space) is checked if it is below a threshold t for a number of generations g. The distribution-based criteria consider the diversity in the population. If the diversity is low, the individuals are close to each other, so it is assumed that convergence has been obtained. - StdDev: It is checked if the standard deviation of positions is below a threshold m. - MaxDist: The distance from every population member to the best individual is observed. The optimization run is stopped if the maximum distance is below a threshold m. - MaxDistQuick: MaxDistQuick is a generalization of MaxDist. Instead of using the whole population for the computation of the maximum distance to the best population member, a quicksort algorithm is employed for sorting the individuals due to their objective function value, and only the best p% of the individuals are regarded. The background for this criterion is that there are optimization runs where most of the population has converged to the optimum but because of the remaining individuals which are still searching, the optimization run is not stopped although they do not contribute any new information. Using MaxDistQuick an optimization run can be stopped earlier than using MaxDist, so wasting of computational resources is avoided. However, the percentage p must not be set too low for a reliable detection of convergence. - Diff : The difference between best and worst objective function value is checked if it is below a threshold d. A further demand is that at least p% of the individuals are feasible because otherwise Diff could lead to undesired results if e.g. only two individuals are feasible and they are close to each other incidentally. In contrast to the previous three criteria that are used in parameter space, Diff considers objective space. Because functions have different features it may be beneficial to couple several criteria. Up to now two combined criteria have been regarded: ComCrit: This criterion is a combination of ImpAv and MaxDist. Only if the condition of ImpAv is fulfilled, MaxDist is checked. Diff_MaxDistQuick: Diff is a criterion that is rather easy to check, but it fails with flat surfaces. Therefore, if its condition has been fulfilled, the MaxDistQuick criterion is checked afterwards. 6 Results As a basis for the examination a real-world problem was used that consists of optimizing a power allocation scheme for a Code Division Multiple Access (CDMA) system [20]. The overall power is minimized considering the powers of 16 individual users as parameters. Because multiple users send data simultaneously in a CDMA system, multi-user interference degrades the system performance. The application of a parallel interference cancellation technique allows estimation of the multi-user interference, so it can be subtracted from the received signal before detection, resulting in improvement of the system performance. The convergence of the parallel interference cancellation technique has to be incorporated in the optimization problem as a constraint. In the following results are shown sorted according to the type of stopping criterion. The global optimum is considered to be reached if an objective function value of f (x) < 18.5 has been found [20]. As performance measures the convergence rate and the success performance (mean number of function evaluations weighed with the total number of runs divided by the number of successful runs) are given. A high convergence rate and a small success performance indicate good performance. To allow easy comparisons, figures showing success performances are scaled to 20,000. A maximum number of generations Gmax = 1000 is used to terminate the algorithm if the examined stopping criteria do not lead to termination in appropriate time. If a run is not stopped before Gmax is reached, the run is considered as unsuccessful. 6.1 Improvement- and Movement-Based Criteria Because ImpAv, ImpBest and MovPar rely on similar mechanisms, the convergence rate and success performance of these criteria are displayed together. Considering the convergence rate, almost no dependence on the number of generations g is observable (Figure 1(a)). For decreasing values of the improvement threshold t generally the convergence rate increases, except for MovPar that was not able to terminate several runs before reaching Gmax for small settings of t. The success performance of ImpAv, ImpBest and Mov-Par is slightly increasing with growing g (see Figure 1(b)). The results regarding t are similar for ImpAv and ImpBest: For high settings of t the success performance is large because of the small convergence rate. After a strong decrease the success performance increases again for smaller values of t because of the growing average number of function evaluations for convergence. The smallest success performance of MovPar is in the same range as for ImpAv and ImpBest. The difference in the average number of function evaluations for different settings of t is larger for MovPar than for ImpAv or ImpBest, thus the success performance grows quickly for decreasing t. As a result the success performance is better for t = 10-2 than for t = 10-4 although the convergence rate of t = 10-2 is worse. The success performance of ImpAv and MovPar has similar characteristics as for DE in [20]. For ImpBest the results are different: The success performance for g = 5 is considerably better for PSO. Furthermore, the success performance is dependent on t and almost independent from g whereas for DE it depends more on g than on t. The reason for the different results is not clear yet. The results for ImpAv and ImpBest are considerably better here than in [19] for unconstrained single-objective problems. For ImpAv the reason might be that the personal best positions are regarded here instead of the current positions, but criterion ImpBest did not change because only the global best result is regarded. In contrast, for MovPar the results are worse, but it has to be kept in mind that the results are slightly dependent on the setting of Gmax because it influences the convergence rate. Unfortunately, suitable parameter settings for ImpAv and ImpBest cannot be derived from knowledge about the optimization problem. Besides, it is indicated in [19] that problems arise for functions with a flat surface, but it is usually not known in advance if a function possesses this property. Therefore, it will be necessary to do examinations on parameter settings for the application of these stopping criteria. Based on the examined problem parameter settings of g « 10 ... 15 and t « 10-5 ... 10-4 are recommended. However, these settings are dependent on the optimization problem and the desired accuracy. It has to be noted also that these criteria may not be as reliable as others because improvement often occurs irregularly in evolutionary algorithms. Criterion NoAcc showed good results for DE in [20] but not a single run could be terminated before reaching Gmax for PSO. Apparently, the personal best positions improve too often to allow a stopping criterion like NoAcc. 6.2 Distribution-Based Criteria For MaxDist the convergence rate does not get above 80% because of runs that could not be terminated before reaching Gmax. The results for StdDev are shifted in contrast to MaxDist and higher convergence rates are reached (Figure 2(a)). Furthermore, StdDev yields a lower minimum success performance than MaxDist (Figure 2(b)). For both criteria the performance is highly dependent on the setting 100n 80- 60- E? 40 ^^ 20 0 10-2 I ImpAv J ImpBest I MovPar 10 10 / 10 5 x io: 1.5- S 0.5- 10 I ImpAv ] ImpBest I MovPar 20 15 g 10 10 10 (a) Convergence rate (b) Success performance 100 90 80 70 60 SS 50 ^40 <5 30 20' 10 10 Figure 1: Results for criteria ImpAv, ImpBest and MovPar 10 10 m 10 10 (a) Convergence rate 2 1.8 1.6 SS 1.4 c ro E^ 1.2 o 1 0.8 0.6 0.4 0.2 0 x 10 10 10 10 m 10 10 (b) Success performance Figure 2: Results for criteria MaxDist and StdDev of m. However, it is connected to the desired accuracy of the result. Similar effects have been found in [20] for DE. The same settings of parameter m yield the lowest success performances for MaxDist and StdDev for PSO as for DE, respectively. The convergence rate and success performance of Max-DistQuick is given for < m < 10^1 in Figures 3(a) and 3(b). Other parameter settings are omitted because the success performance was above 20,000. The convergence rate is fluctuating for m = 0.1 with different settings of p, indicating that the performance is not robust for this parameter setting. For m = {10^2,10^3} and varying p the convergence rate is approximately constant but the success performance rises with increasing p. Hence, a similar result is obtained as in [19]: Because less function evaluations are needed for convergence if smaller values of p are used and the convergence probability is not compromised, it is recommended to use e.g. 0.3 < p < 0.5. In spite of the increased computational effort for the incorporated quicksort algorithm [18], MaxDistQuick is considered to be superior to MaxDist and StdDev for PSO, particularly because the increased computational effort is usually negligible when compared to computationally expensive objective function evaluations of real-world problems. For future work also a similar generalized criterion based on standard deviation instead of maximum distance should be evaluated. For DE the success performance depends less on p [19, 20], so MaxDistQuick does not have advantages over MaxDist for DE. This behavior is supposed to be connected with the greediness of the DE selection scheme. It may be confusing that the success performance for MaxDistQuick with p =1 is not equal to the results of MaxDist. The reason is that the success performance is sensitive to even small changes in the number of success- 1- 5 100 90c 80 70 60 ^ IT, 51 ^40 <5 30 20 10 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 P (a) Convergence rate o t 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 p (b) Success performance Figure 3: Results for criterion MaxDistQuick ful runs. If the average number of function evaluations is examined, the results from MaxDistQuick with p = 1 and MaxDist are similar (not shown here). For criterion Diff no definite trend can be observed regarding the demanded percentage p of feasible individuals in the population (Figures 4(a) and 4(b)) which is assumed to be due to the fact that all individuals get feasible quite fast here. Similar results were found for DE in [20]. As expected, the success performance depends on the difference threshold d. Like parameter m of the other distribution-based criteria, the setting of d is connected with the desired accuracy of the result. The highest convergence rate is achieved with d = 10-2 but although d = 10-1 results in a worse convergence rate, the success performance is better. Criterion Diff is advantageous in contrast to the distribution-based criteria in parameter space if several parameter combinations yield the same objective function value. In this case the distribution-based criteria in parameter space may waste computational resources while the algorithm tries to converge to one point in the search space, with no or only little improvement of the objective function value. However, Diff is likely to produce bad results for functions with a flat surface [19]. 6.3 Combined Criteria The convergence rate and success performance for both combined criteria are given for m > 10-2 because smaller values of m lead to success performances larger than 20,000 (Figures 5(a), 5(b), 6(a) and 6(b)). The results are different than for DE as the success performance increases less with decreasing value of m. Especially for Diff_MaxDistQuick the results are almost independent from m. However, a strong dependence on d can be seen, in particular for the success performance. For the combined criteria generally more parameters have to be set than for the individual criteria and furthermore the dependence of parameter settings on the desired accuracy of the results cannot be seen anymore, so in general it might be easier to use the individual criteria. 6.4 Summary Although the improvement-based criteria ImpAv and ImpBest yielded good results in this work, they are considered as rather unreliable because generally improvement occurs irregularly in evolutionary algorithms. To prevent early termination, parameter g must not be chosen too low when using these criteria. The movement-based criterion MovPar has similar problems. The third improvement-based criterion NoAcc was not able to stop a single optimization run during the given maximum number of generations, so it is classified as unsuitable for PSO although it showed a good performance for DE in [20]. Based on the considered optimization problem as well as results from [19] it can be concluded that it is beneficial to use the generalization MaxDistQuick instead of MaxDist. Because StdDev performed better than MaxDist, a generalization of StdDev should be examined in future work. In general the distribution-based criteria in parameter space are classified as reliable means for detecting convergence. The distribution-based criterion in objective space (Diff) is also considered to be a reliable criterion with the exception of optimization problems that contain objective functions with a flat surface. As the combined criteria are combinations of other criteria, they generally incorporate more parameters that have to be adjusted. So far no advantage of combined criteria could be found that would compensate this drawback, so it is recommended to use the individual criteria. 100 90 80 70 60 SS 50 ^40 <5 30 20 10 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 P (a) Convergence rate 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 P (b) Success performance Figure 4: Results for criterion Diff 80- 60- E? 40 ^^ 20 0 10"' = 1 100- m = 0.1 = 0.01 x 104 ;5 20s 1.5. CO Si (/) w (D Si 0.5. 10 ^^■m = 1 m = 0.1 m = 0.01 10 10 20 10 (a) Convergence rate (b) Success performance Figure 5: Results for criterion ComCrit 7 Conclusions In this work stopping criteria were studied that react adap-tively to the state of an optimization run based on improvement, movement or the distribution of individuals. In contrast to other examinations, not the current positions but the personal best positions were used for the calculations. It was shown that the stopping criteria can be used for constrained problems using PSO. A similar behavior as for DE could be found for several stopping criteria. It would be interesting to make comparisons with other evolutionary algorithms in future work. Although parameter settings have to be determined in dependence on the used optimization problem, general statements could be derived. It was not possible to determine one criterion that will be best for all problems, but because of their adaptive nature generally improved per- formance for real-world problems is expected in contrast to termination after a limited number of function evaluations. For multi-objective optimization the definition of appropriate stopping criteria is even more important because real-world problems usually contain multiple objectives. It will be also even more challenging because usually the population will not converge to one point in the search space but to the Pareto-optimal front, thus using error measures is difficult. One possibility would be to monitor performance measures like hypervolume [3] and calculate e.g. improvement. Another approach from literature is based on observing the development of crowding distances [12]. As only little work is done in this area so far, it is an interesting field of research for future work. 2 g 100 n BO- SO !? 40- 20- = 1 Im = 0.1 = 0.01 / 0 f 0.5 10 10 d 10- (a) Convergence rate 2~