Informatica 31 (2007) 1-13 1 A Novel Roll-Back Mechanism for Performance Enhancement of Asynchronous Checkpointing and Recovery Bidyut Gupta and Shahram Rahimi Department of Computer Science, Southern Illinois University Mail Code 4511, Carbondale, IL 62901-4511, USA {bidyut, rahimi}@cs.siu.edu Yixin Yang Department of Biological Sciences, Emporia State University Emporia, KS 66801, USA yyang@emporia.edu Keywords: asynchronous checkpointing, recovery, maximum consistent state Received: May 26, 2006 In this paper, we present a high performance recovery algorithm for distributed systems in which checkpoints are taken asynchronously. It offers fast determination of the recent consistent global checkpoint (maximum consistent state) of a distributed system after the system recovers from a failure. The main feature of the proposed recovery algorithm is that it avoids to a good extent unnecessary comparisons of checkpoints while testing for their mutual consistency. The algorithm is executed simultaneously by all participating processes, which ensures its fast execution. Moreover, we have presented an enhancement of the proposed recovery idea to put a limit on the dynamically growing lengths of the data structures used. It further reduces the number of comparisons necessary to determine a recent consistent state and thereby reducing further the time of completion of the recovery algorithm. Finally, it is shown that the proposed algorithm offers better performance compared to some related existing works that use asynchronous checkpointing. Povzetek: Opisan je izboljšan postopek okrevanja v porazdeljenih sistemih. 1 Introduction Checkpointing and rollback-recovery are well-known techniques for providing fault-tolerance in distributed systems [1]-[5]. The failures are basically transient in nature such as hardware error [1]. Typically, in distributed systems, all the sites save their local states, known as local checkpoints. All the local checkpoints, one from each site, collectively form a global checkpoint. A global checkpoint is consistent if no message is sent after a checkpoint of the set and received before another checkpoint of the set [2]-[4], that is, each message recorded as received in a checkpoint should also be recorded as sent in another checkpoint. In this context, it may be mentioned that a message is called an orphan message if it is recorded as received in a checkpoint, but not recorded as sent in another checkpoint. The local checkpoints belonging to a consistent global checkpoint will be termed in the present work as globally consistent checkpoints (GCCs). After recovery from a failure processes in a distributed computation restart their computation from a consistent global checkpoint /state (CGS) of the system, i.e. from their respective GCCs. It may be noted that a consistent global checkpoint of a system is termed as a recent or a maximum one if, after the system recovers from a failure, the number of events (states) rolled back at each processor is a minimum [6]. To determine consistent global checkpoints, two fundamental approaches have been reported in the literature [1]-[9]. These are synchronous and asynchronous approaches. In the synchronous approach, processes involved coordinate their local checkpoint actions such that the set of all recent checkpoints in the system is guaranteed to be consistent. Although it simplifies recovery it has the following disadvantages: (1) additional messages need to be exchanged by the checkpointing algorithm when it takes each checkpoint; (2) synchronization delay is introduced during normal operation [5]. In the asynchronous approach, processes take checkpoints independently without any synchronization among them. Therefore, it is the simplest form of taking checkpoints. However, because of the absence of synchronization there is no guarantee that a set of local checkpoints taken will be a consistent set of checkpoints. That is, there may exist orphan messages between the local checkpoints. In order to get rid of the orphan messages while determining the GCCs, processes have to rollback. In such a situation, rolling back one process causes one or more other processes to roll back. This effect is known as the domino effect [5]. This is the main drawback of the asynchronous approach. So, a recovery algorithm has to search for the most recent consistent set of checkpoints before the system restarts its normal operation. Therefore, the recovery process is 2 Informatica 31 (2007) 1-13 B. Gupta et al. quite complex while the checkpointing scheme is much simpler compared to the same in synchronous approach. 2 Related Works In this work, we have considered asynchronous checkpointing approach because of its simplicity in taking checkpoints. So, in this section we state briefly the contributions of some noted related works. When processes take checkpoints independently, some or all of the checkpoints taken may be useless for the purpose of constructing consistent global checkpoints. A set of checkpoints can belong to the same consistent global snapshot if no zigzag path (Z-path) exists from a checkpoint to any other checkpoint [15]. In other words, absence of a Z-path means absence of any orphan message. A theoretical framework for characterizing quasi-synchronous algorithms has been presented in [12]. Quasi-synchronous checkpointing algorithms reduce the number of useless checkpoints by preventing the formation of noncausal Z-paths between checkpoints and advance recovery line. "Advancement of recovery line" is interpreted as follows: the more the recovery line is advanced, the less is the amount of computation to be redone by processes after the system of processes restart their normal operation; meaning thereby the reduction in the amount of rollback per process after the system recovers from failure. Depending on the degree to which the non causal Z-paths are prevented, quasi-synchronous checkpointing algorithms are classified into three classes namely [12], Strictly Z-Path Free (SZPF), Z-Path Free (ZPF), and Z-Cycle Free (ZCF). Manivannan and Singhal [13] have presented a quasi-synchronous checkpointing algorithm which allows the processes to take checkpoints asynchronously and reduces the number of useless checkpoints by forcing processes to take additional checkpoints. In this checkpointing algorithm, each process maintains a counter which is periodically incremented and the time period is same in all the processes. When a process takes a checkpoint, it assigns the current value of its counter as the sequence number for the checkpoint. Each message is equipped (i.e. piggybacked) with the sequence number of the current checkpoint. If the sequence number accompanying the message is greater than the sequence number of the current checkpoint of the process receiving the message, then the receiving process takes a checkpoint and assigns the sequence number received in the message as the sequence number to the new checkpoint and then processes the message. Quasi-synchronous checkpointing algorithm makes sure that none of the checkpoints taken lies on a Z-cycle in order to make all checkpoints useful. Asynchronous recovery algorithms are also presented in this paper based on the checkpointing algorithm. A failed process needs to roll back to its latest checkpoint and requests other processes to rollback to their consistent (latest) checkpoints. The work claims to be free from any domino effect. However, arguably this work is more of a synchronous approach than an asynchronous approach; partly because all processes have identical time periods to take checkpoints, and checkpoint sequence numbers are used so that all the ith checkpoints of all processes are taken at the same time (i.e., logically at same time). Hence, we argue that there is no question of domino effect as this work is not at all an asynchronous approach. Gupta et al. [11] have proposed a hybrid roll forward checkpointing/recovery approach. Processes take checkpoints periodically and these time periods are different for different processes. Periodically, in absence of any failure, an initiator process invokes the algorithm to advance the recovery line; the duration of this period is assumed to be much larger than the time period of any individual process. Therefore, the domino effect is limited by this time period. The main advantages of this work are that each process may need to keep at most two checkpoints at any time, processes participate in the algorithm simultaneously ensuring re-execution time after a failure is limited by the period of execution of the algorithm, and finally, recovery is as simple as in the synchronous checkpointing/recovery approach. Ohara et al. [14] proposed an uncoordinated checkpointing algorithm for finding a recovery line where a given checkpoint is the earliest. In this algorithm, each process maintains a set of all local checkpoints on that process in a local vector. All local checkpoints which are just behind a given checkpoint are initially assumed to form a consistent global checkpoint. The algorithm checks happened-before relation for any coupled local checkpoints belonging to an ordered global checkpoint set. If there exists any happened-before relation, it replaces a local checkpoint with a successive local checkpoint of the same process. The algorithm may end by either finding a recovery line or running out of local checkpoints to be replaced. Venkatesan and Juang [16] presented an asynchronous checkpointing algorithm where each process take checkpoints independently and keeps track of the number of messages it has sent to other processes as well as the number of messages it has received from other processes. The algorithm is initiated by the process which fails and is recovered from thereafter or when it learns about process failure. During its each iteration, a process needs to compare the number of messages received by it and the actual number of messages sent by the other process, at each of its checkpoint starting from the most recent one. The received vectors corresponding to all the checkpoints including the current one and the one where next iteration starts, need to be fetched from the storage in order to decide the checkpoint for the next iteration to start with. 3 System Model The distributed system has the following characteristics [1], [6], [10]: 1. Processes do not share memory and they communicate via messages sent through channels. 2. Channels are made virtually lossless and order of the messages is preserved by some end-to-end transmission protocol. A NOVELL ROLL-BACK MECHANISM FOR... Informatica 31 (2007) 1-13 3 3. When a process fails, all other processes are notified of the failure in finite time. We also assume that no further processor (process) failures occur during the execution of the algorithm. In fact, the algorithm must be restarted if there are further failures. Below we state the problem considered in this work. Problem Formulation: In this work, we have considered asynchronous checkpointing approach because of its simplicity in taking checkpoints. That is, processes take checkpoints periodically and each process determines independently its time period of taking its checkpoints. So, different processes may have different time periods for taking their checkpoints. After the system recovers from a failure, processes start from the recent consistent state of the system. However, the main drawback of this approach is that determining a consistent global checkpoint may involve a very large number of pairwise comparisons of checkpoints belonging to different processes because of the presence of a possible domino effect. In absence of any hybrid approach [11], in the worst case, all checkpoints of all processes may have to be compared. However, asynchronous checkpointing approach is suitable for highly reliable systems where failures occur very seldom. In this work, our objective is to design an efficient recovery algorithm that will reduce considerably the number of unnecessary pairwise comparisons of checkpoints while determining a consistent global checkpoint. In other words, our objective is to identify a priori the checkpoints that can not be the GCCs so that we can exclude these checkpoints from comparison resulting in a fast determination of a recent consistent global checkpoint (state) of the system. Note that an initial version of this work has appeared in [17]. 4 Data Structures Let us assume that the distributed system under consideration consists of n processes. Each process Pi maintains a vectors V! of length n. The V! vector records the number of messages process Pi has sent to every other process with the exception that the element v^ (=Vi (i)), i.e. the number of messages process Pi has sent to itself will be always zero. The Vi vector is described below: Vi = [vi,0, vu,.....,v1,1,.....,v1,n-i] where vi,j = Vi (j) and represents the number of messages sent by process Pi to process Pj, and vi,i is always zero. All entries in Vi are initialized to zero. Each time process Pi decides to send a message m to process Pj, then Vi(j) is incremented by one. This facilitates process Pi to know how many messages it has sent to process Pj. In this work, Cj,r represents the rth checkpoint taken by process Pj. Sometimes when mentioning the checkpoint number is irrelevant, we simply use Cj to denote a checkpoint taken by Pj. Each process Pi also maintains a linear list Ri of dynamically growing length. At any given time t, the length of the list Ri (i.e. the number of the entries in the list) is equal to the number of checkpoints taken by Pi till time t. For example, the length of the list is 3 at the 3rd checkpoint of process Pi where as its length will be 4 at its 4th checkpoint and so on. The list Ri is described as Ri = [ ri1, ..... ,rir, ......,], where rir = Ri (r) and represents the number of messages received by process Pi from all other processes till its rth checkpoint. Each such list is initially empty. Each process stores its vectors and the lists together with the corresponding checkpoints in stable storage. Also copies of the lists and the vectors are stored in the respective local memories of the processors running the processes. It offers their faster access than to access them from stable storage whenever possible. In addition, each process maintains a Boolean flag. This flag is used to convey some specific information (described later). 5 Observations Consider the system of three processes P1, P2, and P3 as shown in Fig. 1. The vectors Vu V2, and V3 initially have all their entries set to zero. The lists Ri, R2, and R3 are initially all empty. By the time process P1 takes its first checkpoint C1,1, it did not send any message to P2 or P3. So its V1 vector is [000]. Also, process P1 received one message before it took its first checkpoint; so now the list R1 has one entry, i.e. R1 = [1]. By the time process P1 takes its second checkpoint C1, 2, it has already sent one message to P2. So it increments V1(2) by 1 and the vector V1 is now = [010]. Also, process P1 has not received any messages (from P2 or from P3) before it takes its second checkpoint. So the list R1 at C12 is [1,1]. In the same way, the vector and the list are updated at each checkpoint of each process. This example will be used later in this paper to illustrate the working principle of our proposed algorithm. We assume that a process Pi after recovery from its failure acts as the initiator process, i.e., Pi is responsible for invoking the recovery algorithm. To start with Pi sends a message requesting all Pj, 0 < j < n-1, j 4 i, to send to it their respective Vj vectors corresponding to their latest checkpoints. Upon receiving the request, every process Pj sends its Vj to Pi. After receiving the vector Vj from all processes the initiator process Pi forms a two dimensional array VN 0 Vijt o *n-]J0 v^-U It is written below. Vo.n-j vlhn-l 0 where the j row represents Vj, 0 < j < n -1. The initiator process then computes the column sums to create the following vector: Vc = [vc0, vc1, ..., vcj, ... ,vcn-1] where vcj = column sum of the entries of the jth column of VN and is given as 4 Informatica 31 (2007) 1-13 B. Gupta et al. VcJ = Vc(j) = IVn (i , j), for i = 1 to n. Therefore, vcj represents the total number of messages sent to process Pj by all other processes as recorded in each sending process' latest checkpoint. The initiator process Pi then unicasts vcj (= Vc(j)) to process Pj. After receiving vcj from P;, each process Pj computes Dj = Rj(r) - vcj, assuming that the last checkpoint of process Pj is the rth checkpoint (Cj,r). The difference Dj (if >0) gives the exact number of orphan messages received by a process Pj till its checkpoint cj,r, from all other processes in the system. Initiator process Pi also does similar computation to determine the exact number of orphan messages (if any) it has received till its latest checkpoint Cir. Proof of this statement is given later. Vi V, V, V, R R, R,= R,=1,1 I44I4I x wi zxx R V3 Figure 1: Vectors (Vi) and lists (Ri) for i = 1, 2, and 3 Observe that for every process Pj, vcj and Rj(r) may not be identical, because some of the sent messages (recorded already by the sending processes at their respective latest checkpoints) may not have arrived yet at Pj (i.e. vcj > Rj(r)), or some of the received messages (by Pj) may not have been recorded at the latest checkpoints of some sending processes because these messages may have been sent after their latest checkpoints (i.e. vcj < Rj(r)). Assume that the last checkpoint of process Pj is the rth checkpoint (Cj,r) and Dj is greater than zero (Dj >0). Search in the list Rj is performed backwards, starting with its last component. Thus, we search the proceeding entries of the list Rj from Rj (r) till the first Rj (m) so that Rj (r) - Rj (m) > Dj , (m < r). Then, the checkpoints Cj,r, ..., Cj, m+1 are excluded from the consideration of GCC composition, i.e. these checkpoints will be skipped. So, now we start from the checkpoint Cjm of process Pj. The vector Vj at checkpoint Cj m along with the Boolean flag "1" are sent to the initiator process Pi for the computation of the next iteration. In the next iteration, if Dj is smaller than or equal to zero (Dj < 0), which means that process Pj has not received any orphan message till the checkpoint Cjr. process Pj will send the flag "0" to the initiator process Pi . The initiator process Pi will use the vector Vj at Cj r for the computation of the next iteration. Initiator process Pi is also involved in similar computation like any other process Pj to determine its appropriate vector Vi needed for the computation of the next iteration. This will be repeated until all processes send "0" flags to the initiator process Pi and Pi's own flag is also 0 . Then the initiator process Pi will notify all processes to rollback to their respective latest checkpoints at which their corresponding flags have the value 0 each. Thus, this set of checkpoints is a globally consistent checkpoint (proof is given later). The following observations are necessary for designing the recovery algorithm. Lemma 1: Let Cjr be the latest checkpoint of process Pj at time t. If Dj > 0, then process Pj has received a total Dj number of orphan messages from other processes. Proof: Rj(r) represents the total number of messages process Pj has received so far from all other processes till time t. Also vcj represents the total number of messages sent by all other processes to Pj as recorded in their latest checkpoints. Therefore Dj > 0 means that at least some process Pi (i 4 j) has sent some message(s) to Pj after taking its latest checkpoints. It also means that the sending processes have not yet been able to record these Dj messages. Since all such Dj messages have been received and recorded in Pj's latest checkpoint, but remain unrecorded by the sending processes, therefore Pj has received Dj number of orphan messages from the rest of the processes with respect to the checkpoint Cjr. ■ Lemma 2: If Dj < 0, process Pj has not received any orphan message. Proof: Dj = 0 means that the number of messages received by Pj is equal to the number of messages sent to Pj and these sent (also received) messages have already been recorded by the sending processes in their latest checkpoints. Therefore the received messages can not be orphan. Also, Dj < 0 means that the number of the messages received by Pj is less than the number of messages sent to it. Now vcj is the total number of messages sent by all other processes to Pj as recorded in the latest checkpoints of the sending processes. It means that all messages received by Pj have already been recorded by the senders. Hence none of such received messages can be an orphan. Hence the proof follows. ■ Lemma 3: Let Dj > 0 at the checkpoint Cjr of process Pj and let m denote the largest integer that satisfies Rj(r) -Rj(m) > Dj (m < r). Then none of the checkpoints Cjr, Cjr-1, ..., Cj, m+1 belongs to the set of the globally consistent checkpoints. Proof: Because m is the largest integer that satisfies Rj(r) - Rj(m) > Dj (m < r), the relation Rj(r) - Rj(i) < Dj is established for any i ( m+1 < i < r). Moreover, according to Lemma 1, Pj has received exactly Dj number of orphan messages from all other processes. So there must be at least one orphan message received by process Pj before Cjr, and the same also is true before every checkpoint between Cjr and Cjm. Hence, none of the checkpoints Cjr, m+1 can belong to the set of the globally consistent checkpoints. ■ V Fa A NOVELL ROLL-BACK MECHANISM FOR... Informatica 31 (2007) 1-13 5 Theorem 1: Given a set S = {Cjr} of n checkpoints, one from each Pj, 0 < j < n -1, if for every checkpoint Cj r, its corresponding Dj < 0, then S* is the set of the globally consistent checkpoints. Proof: Since Dj < 0, for each process Pj, ( 0 < j < n-1) at its checkpoint Cjr e S*, therefore, all received messages by any such process Pj have already been recorded as sent by the sending processes in their corresponding checkpoints. Hence, according to Lemma 2 none of the messages received by process Pj is an orphan message. This is true for all processes. Therefore, the system of n processes does not have any orphan messages with respect to the checkpoints of the set S*. Hence the set S* is the set of globally consistent checkpoints. ■ Before we present the algorithm formally, we give an illustration of its working principle using the example of Fig. 1. An illustration: Suppose a failure 'f occurs on the processor running the process P1. The process P1 that became faulty, acts as the initiator after recovery from failure. After the system recovers from the failure, to start with, initiator process P1 broadcasts a request asking the other two processes P2 and P3 to send their respective vectors V2 and V3 corresponding to their latest checkpoints C21, and C3,2. In this example, the three latest checkpoints of processes P1, P2, and P3 before the failure occurs are C15, C21, and C3,2. The respective vectors V1, V2, and V3 at the three latest checkpoints are [010], [100] and [020]. After receiving all these vectors, P1 (it becomes the initiator after recovery from failure) forms a two dimensional array VN. It is written below: 0 1 0 VN = 1 0 0 0 2 0 P1 creates the vector VC = [130] and unicasts vcj to each process Pj, for j = 1, 2, and 3. After receiving vcj from Pj each process Pj computes Dj ( = Rj(r) - vcj) (assuming the last checkpoint of Pj is the rth checkpoint) to determine the total number of orphan messages (if any) it has received with respect to its latest checkpoint and also Pi does the same. The lists R1, R2, and R3 at the latest checkpoints (C1>5, C2>1, and C3,2) of processes Pi, P2 and P3 are [1,1,2,4,5], [2] and [0,1] respectively. P1 finds that D1 = (5-1) = 4; so it has received 4 orphan messages. It calculates the difference between R1(5) and R1(2) and finds that R1(5) - R1(2) = 4 = D1; so process P1 now considers the vector V1 (= [010]) at C12 along with a flag "1" for the computation of the next iteration. P2 finds that it has not received any orphan message because D2 = (2-3) < 0. So it sends the same vector [100] and a flag "0" to P1. Process P3 finds that D3 = (1-0) = 1; so it has received an orphan message. It calculates the difference between R3(2) and R3(1) and finds that R3(2) - R3(1) = 1 = D3; so process P3 now sends the vector V3 (= [010]) at C3,1 along with a flag "1" to P1 for the computation of the next iteration. In the second iteration, P1 forms the following two dimensional array. Vn = 0 1 0 1 0 0 0 1 0 P1 creates the vector VC = [120] and unicasts vcj to process Pj, for j = 1, 2, and 3. P1 finds that it has not received any orphan message because at C1,2, its D1 = 1 - 1 = 0. So, it sets its flag to 0. P2 also finds that it has not received any orphan message because at C2,1, its D2 = 2 - 2 = 0; and it sends the flag "0" to P1. Similarly, P3 finds that it has not received any orphan message because at C31, its D3 ( = R3(1) - vc3) = 0 - 0 = 0, and it sends a flag "0" to P1. Thus, P1 receives flag 0 from each process including its own flag set to 0. It then notifies each process to rollback to the current checkpoints corresponding to these flags (= 0). At this time, none of the processes needs to roll back further and hence P1 terminates the algorithm. Thus the algorithm terminates after two iterations. Therefore the GCCs belonging to the maximum consistent state are C1,2, C2,1 , and C3,1. It may be noted that in each iteration we need to fetch only the latest Rj for each process Pj and some Vj vectors (not all) to determine the GCCs. In each iteration, the checkpoints that can not be the GCCs are identified and their vectors Vj are not fetched at all. That is, the presented approach will not repeat its operation unnecessarily for these vectors corresponding to these non-GCCs. It definitely makes the approach fast and efficient. Observe what happens if we do not consider the above idea to determine the GCCs. It is stated below. First, C1,5, C2,1, and C3,2 are considered and compared pairwise to determine if they are globally consistent. Since C1,5 and C3,2 are not, so in the next iteration C1,4, C2,1, and C3,1 are considered pairwise. But C1,4 cannot be a GCC. Therefore C1,3, C2,1, and C3,1 are now considered. But since C1,3 can not be a GCC, therefore C1,2, C2,1, and C3,1 are now considered. This time it is found that these three checkpoints are globally consistent. Therefore four iterations for pairwise comparisons of three checkpoints, one from each process, are needed to determine the GCCs as opposed to only two when the approach presented in this work is followed. It also means that the number of trips to the stable storage for fetching checkpoints can also be reduced to a good extent in the proposed approach. It definitely makes our algorithm fast. Moreover when processes take large number of checkpoints before a failure occurs, our approach may offer even much better performance from the viewpoint of a possible large reduction in the number of iterations (i.e. the number of trips to stable storage as well) to determine the GCCs. As a result, the recovery scheme also will be faster. Besides, it is clear from the example that each process Pj simultaneously identifies the checkpoints that cannot be globally consistent and therefore these checkpoints should be skipped. This parallelism of the algorithm further enhances the speed of execution of the recovery approach. 6 Informatica 31 (2007) 1-13 B. Gupta et al. 6 Algorithm to Determine Globally Consistent Checkpoints In the following algorithm we assume that process Pi was faulty. So, it becomes the initiator of the recovery algorithm after it recovers from the failure. 6.1 Algorithm Recovery Input: Given the latest n checkpoints, one for each process Pj , 0 < j < n-1, for an n process system and the corresponding vectors Vj and lists Rj at these n checkpoints. Output: A set of globally consistent checkpoints (maximum consistent state of the system). The responsibilities of each participating process Pj and the initiator process Pj are stated in Fig. 2. Proof of Correctness: Each process Pj repeats its steps 1, 2, 3, and 4 to arrive at a checkpoint that has not recorded the receipt of any orphan message from the other processes (using the observations of Lemmas 1, 2, and 3). In other words, it identifies the checkpoints that can not belong to the set of the globally consistent checkpoints and skips them. This decision is taken by identifying a checkpoint Cj,m such that m is the largest integer that satisfies Rj(r) - Rj(m) > Dj (m < r). None of the checkpoints Cj,r, Cjr-1, ..., Cj, m+1 can belong to the set of the globally consistent checkpoints and they are skipped. However, the initiator process Pj decides when to terminate the algorithm, i.e., when the checkpoints can become globally consistent. Process Pj checks to see if all processes send flags of 0, i.e. Dj < 0 for each process Pj. If so, the algorithm terminates according to Theorem 1. Note that the condition Dj < 0 must always occur during the execution of the algorithm. It may be observed that in the worst case, because of some typical communication pattern, the domino effect may force each process to restart from its initial state where for each process Pj we always have Dj = 0. Besides, since the algorithm starts with the latest checkpoints, the number of events (states) rolled back at each processor is a minimum. This is true because, in its Step 4 each process Pj skips only the checkpoints that are non GCCs. Thus the algorithm determines the maximum consistent state of the system as well. ■ 6.2 Advantages of the proposed approach The presented algorithm offers the following advantages. During its each iteration, each process Pj determines the checkpoints that can not be the GCCs. Therefore, the algorithm is able to avoid any unnecessary computations of VC corresponding to these non GCCs. The presented algorithm skips checkpoints that do not belong to the set of the globally consistent checkpoints; thus it avoids many unnecessary pairwise comparisons. It also means that the number of trips to the stable storage for fetching checkpoints can also be reduced to a good extent in the proposed approach. It definitely makes the algorithm fast and efficient. The simultaneous execution of the algorithm by all participating processes also contributes to the speed of execution of the algorithm. Besides, the algorithm can find the maximum number of checkpoints to be skipped by determining the largest integer m, which satisfies Rj(r) - Rj(m) > Dj. This guarantees significant reduction in the iterations of computation. 6.3 Performance Message complexity: Suppose the termination of the algorithm requires the construction of the vector VC by the initiator process Pj to occur k times (i.e. k number of iterations). During each such time every process in the n-process system exchanges a couple of messages with the initiator process P;. Thus, O(n) messages are sufficient for each time. Thus, considering k times, the message complexity of the algorithm is O(kn). Besides message complexity, another factor that must be considered as a performance measure is the number of pairwise comparisons of the checkpoints among the processes that is needed to be performed by any asynchronous checkpointing/recovery approach. This is done in order to determine a consistent global state of the system. Obviously larger the number of such comparisons, larger is the execution time of the recovery algorithm. This has been discussed in the previous subsection. It may be noted that the number of such pairwise comparisons is also related to the number of times checkpoints are fetched from stable storage, i.e. the number of trips to the storage. The time spent on such trips may be substantial enough to affect to a good extent the speed of execution of any recovery algorithm. One possible solution may be to fetch a large number of checkpoints at a time. However, it may not be a good idea at all in many situations; for example, a process may end up in fetching too many when that many are not needed, or too little when more are needed. So, it becomes quite arbitrary about how many checkpoints should be fetched at a time. Therefore, it is wise to consider that a process will fetch one checkpoint at a time and in fact, this is true for all existing asynchronous checkpointing / recovery algorithms. In the following analysis we consider the fact that larger the number of pairwise comparisons of checkpoints, larger is the number of trips to stable storage, and therefore, larger is the execution time as a result. In our analysis we will not consider complexity due to message size, as most related works including ours use control messages of reasonably small size and all these works differ mainly in terms of the number of comparisons, number of iterations, and the number of control message needed to determine a consistent global state. It may be noted that computing this number of comparisons is not very straightforward because it depends solely on the nature of the distributed computations. However, we give an approximate analysis which may not be very accurate; still it will offer a clear understanding of the advantages of our algorithm A NOVELL ROLL-BACK MECHANISM FOR... Informatica 31 (2007) 1-13 7 liijtialioroyjs^: Step 1: It asks every process P, lo send its V, corresponding 0 [i searches the list Rj till ¡1 linds the largest integer in (< rj lhal satisfies R.,(rl - K,(m> > D,. Then it sets its Hag to I and considers V; corresponding to its checkpoint C; m li e. Cj>ris replaced by Cj„ ) for the next iteration;, f* Checkpoints C^ Qpj.....Cilxr/ are skipped V else 11 Kts its lliig. to 0 and considers Vj at C^ for the next iteration: Step Jt receives the flag ¡«id V, from each process Py; if flag = 0 for each pn>eess Pj, 0 < j < n-l P, asks cach process P, to restart the application program from its last checkpoint corresponding lo which D(; P, resets i|s vector V, to zero and list R, to an empty list corresponding to it* restart in ti chcckpoint at wbich D, < 0; It restarts Computation; I* its responsibility associated with the algorithm is finished */ * Giobatfy1 consistent checkpoints belonging, to (he maximum consistent sitae tire determined*/ else Control ilo^vs to S from P,; It computes D,hy calculating (R.j 0 It searches the list Rj till it finds the largest integer m (< r) tliat udsfi« K:tr) - R,(m)> Dj. Then it sends a flag of 1 and Vjto P, corresponding to its checkpoint C,, q, (i,e, C,,r is replaced by Cj^ f* Checkpoints CtJ. C;irtl; tire skipped */ else It sends a flag of 0 and Vjat Cj r to the initiator process P^ 8 Informatica 31 (2007) 1-13 B. Gupta et al. Figure 2: The responsibilities of each participating process Pj and the initiator process ?! over some other noted asynchronous checkpointing / recovery approaches [14], [16]. It is stated below. Let the system consist of n processes. For simplicity we assume that after a failure occurs and the system recovers from it, each process will skip on an average its latest (r-1) checkpoints to restart its computation. Thus a process Pj will skip its latest (r-1) checkpoints Cjm+2, .... , Cj,r+m. We also assume that the set {C0,m+1, C1>m+1, ...,Cn-i,m+i) represents the globally consistent checkpoint (maximum consistent state) of the system and our algorithm will determine it in k number of iterations. In this simple model, we consider a recovery approach associated with asynchronous checkpointing scheme in which the pairwise comparisons to determine checkpoints' consistency involves first the checkpoints of the set {C0m+r, ... , Cn-1m+r}, followed by the set {C0,m+r-1, ... , Cn-1jm+r-1}, ... and so on, and finally the set {C0m+1, .., Cn-1,m+1} which is the globally consistent state. Therefore, the total number of comparisons is given by [r x{n(n-1)}/2]. Note that this may not be the exact way to perform the comparisons in a particular case; still it offers a clear view of how complex it can be. In general, a checkpoint(s) in one set may also have to be compared with a checkpoint(s) in another set. On the other hand, not necessarily all checkpoints in a set may be needed to be pairwise compared. It depends on the nature of the distributed computations. So the actual number of comparisons may be larger or smaller than the number [r* {n(n-1)}/2]. Anyway, it is clear that this number is much larger than the total number of comparisons k*n, offered by our approach, where n is the number of parallel comparisons to test if Dj > 0 in each iteration and 1 < k < r. Observe that in the worst case, the number of comparisons of the proposed approach may become [r* {n(n-1)}/2]. Below we have compared the performance of our approach with the approaches in [14], [16]. 6.3.1 Comparison with Ohara et. al. [14] Ohara et al. [14] have proposed an asynchronous approach for finding a recovery line where a given checkpoint is the earliest. All the local checkpoints which are just behind a given checkpoint are initially assumed to form a consistent global checkpoint. In this algorithm, happened-before relations are checked for every coupled local checkpoints belonging to an ordered global checkpoint set. If there exists any happened-before relation, it replaces a local checkpoint with a successive local checkpoint of the same process. The algorithm may end by either finding a recovery line or running out of local checkpoints to be replaced. This leads to exhaustive comparisons of happened before relations for every coupled local checkpoints. The number of such comparisons is approximately [r* {n(n-1)}/2] as calculated earlier. In our algorithm, it skips the checkpoints that do not belong to the set of the globally consistent checkpoints. Thus, our algorithm reduces to a good extent unnecessary pairwise comparisons of the checkpoints to determine global consistent checkpoint of the system. Performance comparison of the above mentioned approach [14] and our approach is shown in Fig. 4. Fig. 3 illustrates how the number of comparisons is affected with the increase in the average number of checkpoints per process (r) in the asynchronous approach [14] and in our approach. Fig. 4 shows the variation of the number of comparisons with the increase in the number of processes (n). Both figures highlight the advantages offered by our approach, i.e. considerable amount of reduction in the number of comparisons in our approach. It helps the processes to restart their computation related to the distributed application much faster after the system recovers from a failure. 3000 M 2500 8 z 2000 w I linn 0 0 ■K 1000 ö 500 0 5 10 15 20 25 30 35 40 45 50 The Propoe ed Approach —x — Ohara ei al Approach Figure 3: Number of comparisons vs. the average number of checkpoints per process (r). 6.3.2 Comparison with Venkatesan et. al. [16] Venkatesan and Juang [16] presented an asynchronous checkpointing algorithm where each process takes checkpoints independently and keeps track of the number of messages it has sent to other processes as well as the number of messages it has received from other processes. The existence of orphan messages is discovered by comparing the number of messages sent and received. The algorithm is initiated by the process when a failure occurs or when it learns about process failure. 2500 s 5 2000 i z « 150U & c 1000 0 0 500 0 9 10 n 12 15 17 20 ■ The Proposed Approach —x — Ohara ei a L Approach Figure 4: Number of comparisons vs. the number of processes (n). A NOVELL ROLL-BACK MECHANISM FOR... Informatica 31 (2007) 1-13 9 During its each iteration, a process needs to compare the number of messages received by it and the actual number of messages sent by the other process, at each of its checkpoints starting from the recent one. The received vectors corresponding to all the checkpoints including the current one and the one where next iteration should start, need to be fetched from the storage in order to decide the checkpoint for the next iteration to start with. It means that the number of trips to the storage for fetching the information related to the received message (for the purpose of comparison) will be equal to the number of checkpoints starting from the current checkpoint all the way to the checkpoint where the next iteration should start. 3000 - u> 2500 --- a> ™ 2000 --- u> W JÏ 1500 ---- "5 1000 ---- ° 500 ----- 0 _„ __x—* t t T 0 10 20 30 40 50 60 n The Proposed Approach —x—Venkatesan and Juang Approach Figure 5: Number of control messages vs. the number of processes (n). In our algorithm, the decision about the checkpoint at which the next iteration should start is based on the R-vector at the recent checkpoint only. This algorithm skips checkpoints that do not belong to the set of the globally consistent checkpoints by examining this R-vector only. Therefore, in order to determine the checkpoint for the next iteration to start with, the number of trips to the storage is only one per iteration. This means that the total number of trips to complete the execution of our algorithm is reduced to a good extent compared to that in [16]. We now compare the two algorithms based on the number of control messages needed to execute the respective algorithms. In [16], in each iteration, for an n-process system n(n-1) messages are exchanged among the processes. Thus, O(n2) messages are exchanged in each iteration. In our algorithm, 3(n-1) messages are exchanged in each iteration. Thus, O(n) messages are sufficient in each iteration in our algorithm where n is the number of processes in the system. Fig. 5 shows the message complexity comparison of the two algorithms with the increase in the number of processes. This figure clearly shows the advantage offered by our algorithm over the one in [16]. 7 Further Enhancement We have seen that the linear list Rj maintained by a process Pj increases dynamically. If the application program has large execution time and there is seldom any failure during its execution, the length of the lists may become too large; thereby it may consume considerable amount of memory. To solve this problem, i.e. to keep the list from growing too much we will propose a simple solution in this section. The following operation is needed in the implementation of the idea. We define the subtraction operation on two vectors Vj of process Pj at its two checkpoints Cjm and Cjs with (s > m) as follows: Vj at Cj,s - Vj at Cj,m = [(j at Cj,s - j at C,,m), ... , (vj, n-1 at Cj,s - vj, n-1 at Cj,m)] = [(j at j - j at Cj,m)] for 0 < p < n-1 We now state the basic idea to keep the growing lengths of the lists in control. This idea has been used in designing the enhanced recovery algorithm stated later in this section. It may be noted that the recovery algorithm stated earlier does not consider the use of this idea. In absence of any failure an algorithm runs periodically (say the time period is T which should be much larger than the time period of any individual process) to put a limit on the length of the R-vector. The lengths of the lists (R-vectors) may then be limited by the number of checkpoints taken by the processes during the time interval (T) between two successive executions of the algorithm. Besides in doing so, this also advances the recovery line in the event that a recent recovery line exists other than the one found during the previous execution of the algorithm. In effect, the number of comparisons of the checkpoints to determine a recent consistent state may also drastically reduce since there is a possibility that the algorithm will consider in a particular run only the checkpoints which the processes take during the interval T. Therefore, this enhanced algorithm, in general, may take much less time to complete its execution compared to Algorithm Recovery. Also note that at the completion of the lth execution of the algorithm a process Pj will have in stable storage only its recent globally consistent checkpoint, say Cjm and any other checkpoint (s) it has taken thereafter and prior to the start of the lth periodic execution of the algorithm. In describing the following two rules for updating the lists Rj and the vectors Vj of a process Pj we have assumed that the latest globally consistent checkpoint of process Pj is Cjm as determined by the lth execution of the algorithm and it has taken (k-m) more checkpoints thereafter and prior to the start of the lth periodic execution of the algorithm. Rule 1: Updated Rj at Cj m = {} and updated Vj at Cj,m = [00.0] Rule 2: Updated Rj at Cj s for (m+1 < s < k) = [(Rj(m+1) - Rj(m)), ... , (Rj(s)- Rj(m))], and Updated Vj at each Cj,s = [(vj,p at Cj,s - vj>p at Cj,m)] for 0 < p < n-1 When we implement the above idea of reducing the lengths of the lists, either of the following two approaches can be adopted: Approach 1: When a failure occurs and the system recovers from the failure, the algorithm is run again in 10 Informatica 31 (2007) 1-13 B. Gupta et al. spite of its periodic execution, with the hope that a recent (maximum) consistent state may be found which is not identical to the one determined by its last periodical execution. In such a situation the time to complete the application will be less because of the advancement of the recovery line. On the other hand, if such a situation as mentioned above does not exist, the algorithm will output the same consistent state as determined in its last periodic execution. In this case, however, the application will take an additional amount of time equal to the execution time of the algorithm for its completion. Approach 2: After the system recovers from a failure all processes restart from their respective globally consistent checkpoints which have already been determined by the algorithm's last periodic execution prior to the occurrence of the failure. The recovery becomes as simple as that in a synchronous approach. However, since this approach does not look for the possible existence of a recent consistent state other than the already existing one, therefore the time to complete the application may increase. Observe that irrespective of which approach is followed, the next periodic execution of the algorithm will occur T time units after the system restarts. About when to apply a specific rule, Rules 1 and 2 will be implemented when the algorithm runs periodically in absence of any failure. Rule 1 is also implemented when determination of a consistent global state of the system is needed after the system recovers from a failure (Approach 1). In the following algorithm we have considered a combination of the two approaches. For the selection of an initiator process for running the algorithm periodically, we consider that each process ?! maintains a local CLK variable which is incremented at periodic time interval T. It also maintains a local counter denoted as counteri , initially set to 0 and is incremented by process Pj during its turn to initiate the recovery algorithm. Thus, a process on its own determines if it is its turn to initiate the execution of the algorithm. In this context, observe that the set of GCCs is unique and is independent of the initiator process. We state below how a process Pi does it before we formally state the algorithm: Selection of an initiator process: At each process Pj (0 < i < n-1): If CLKj = (i+(counteri*n))*T counteri= counteri+1; /*When its turn to initiate the recovery algorithm, i.e., Pj becomes the initiator*/ Algorithm Recovery - Enhanced: Input: Given the latest n checkpoints, one for each process Pj , 0 < j < n-1, for an n process system and the corresponding vectors Vj and lists Rj at these n checkpoints. Output: A set of globally consistent checkpoints (maximum consistent state of the system). The responsibilities of the initiator process Pi and each participating process Pj are stated in Fig. 6. An example: Consider the system as shown in Fig. 7. Ignore the presence of the failure 'f for the time being. Suppose that the periodic execution of algorithm starts immediately after processes P1 and P3 take their respective checkpoints C1,5 and C3,2. The algorithm determines the latest consistent global checkpoint of the system. It is (Cu, C^, Cy}. The two rules are applied to update the lists R1, R2, and R3, and the vectors V1, V2, and V3 at the checkpoints of processes P1, P2, and P3 starting from their respective latest globally consistent checkpoints, which are namely C1,2, C2,1, C3,1. The system with the updated lists and vectors is shown in Fig. 8. The checkpoints shown in Fig. 8 are the only ones saved in stable storage. Now assume that a failure 'f has occurred. Therefore the algorithm determines the consistent global checkpoint of the system, which is {C1>2, C2>1, C3>1} and applies only Rule 1 to reset the vectors to zero and to make the lists empty at the respective GCCs of the three processes. V V V V V R R R1 R1= R1=1, I 4 4 I i I X r X .,./7/ 7 ^fx R3 = 0, Figure 7: Before the execution of the algorithm The system in this situation is shown in Fig. 9. The three respective consistent checkpoints are the only ones saved in the stable storage at this time. Note that the consistent global state remains the same (see Figs. 8 and 9). This is the situation when time to complete the application program increases by an amount equal to the time to execute the recovery algorithm. This has been pointed out earlier in the description of Approach 1. However, this will not happen if only Approach 2 is followed for recovery. Fa R3 = 0, V3 A NOVELL ROLL-BACK MECHANISM FOR... Informatica 31 (2007) 1-13 11 V V V V R R R R1= 1, R , R3= 01, V3 V=,=000 =010 Figure 8: After the execution of the algorithm in absence of any failure R1= {}, R2 = {}, V1=000 R3= {}, V3 Figure 9: The system restarts from its consistent global state {Ci)2,C2,i,C3)i}after recovery. 7.1 Comparison with [11] and [13] Gupta et al. [11] have proposed a roll-forward hybrid checkpointing / recovery scheme using basic checkpoints. The direct dependency concept used in the communication-induced checkpointing scheme has been applied to basic checkpoints to design a simple algorithm to find a consistent global checkpoint. They have used the concept of forced checkpoints that ensures a small reexecution time after recovery from a failure. This scheme has the advantages of simple recovery as in synchronous approach and simple way to create checkpoints like in asynchronous approach. Our proposed approach (enhanced version) is not a hybrid approach. It runs periodically only to put a limit on the size of the R-vectors. This is the primary objective of the enhanced approach. In doing so it may come out with a recent recovery line that is different from the one found during the last execution of the algorithm. Thus, effectively as mentioned earlier, even though the proposed algorithm is not a hybrid one, still as in [ii] it may reduce drastically the number of comparisons needed to identify a recovery line, as well as it may limit the domino effect by the time period T, based on the message communication pattern among the processes. Our proposed approach is quite different from the work in [i3] in that in our approach processes take checkpoints completely independently based on their individual time periods that are different for different processes. In [i3], processes take checkpoints with the same time periods and they make sure that there is no orphan message between any two ith checkpoints of two processes. Therefore, it is more of a synchronous approach than an asynchronous approach, where as our approach is purely an asynchronous approach. 8 Conclusions In this paper we have presented an efficient recovery algorithm for distributed systems. Asynchronous checkpointing scheme has been considered because of its simplicity in taking checkpoints. The main feature of the recovery algorithm is that to determine a maximum consistent state, the algorithm in its each iteration does not need to compare all the vectors at all the checkpoints of the processes. In its each iteration the algorithm identifies and skips those checkpoints that can not belong to the set of the globally consistent checkpoints. It not only reduces the computational overhead to a good extent, but also the number of trips to the stable storage for fetching checkpoints is reduced compared to the works in [14] and [16], and as a result its execution becomes even faster. In this context, it may be noted that in any algorithm that uses asynchronous checkpointing, there is always some computational time wasted to create process checkpoints that later do not belong to CGS and this problem can not be avoided. This is true for our proposed algorithms as well. Besides, it is executed simultaneously by all participating processes while determining a maximum consistent state. It further contributes to its speed of execution. We have also proposed a simple enhanced asynchronous recovery scheme to control the dynamically growing length of the lists. In effect, the number of comparisons of the checkpoints to determine a recent consistent state may also drastically reduce and based on the communication pattern among the processes it may limit the domino effect by the time period T. Even though we do not apply any hybrid checkpointing scheme [11], still this approach offers the option to achieve a recovery scheme which is as simple as the approach proposed in [11]. In this context, it may be noted that if the system model changes such that order of the messages sent through the channel cannot be preserved, it will adversely affect the processing time, because a process must wait to receive message m1 before processing its already received message m2. Here, we have assumed that the proper order is m1 followed by m2. Our future work is directed at the new challenging area of designing recovery schemes for cluster federation computing environment in which different clusters may adopt different ways for checkpointing, for example, some may apply coordinated approach, where as other may apply asynchronous approach [18 ], [19 ]. 000 12 Informatica 31 (2007) 1-13 B. Gupta et al. Initiator nrncest I'.: Step I: i I" t lie li I t£t>r i I h:n is executed or failure recovcryonfailure = 1; else nccoicryonfailunc = 0; 11 asks every process Pj (j £ i) to send its V: corresponding lo its latest checkpoint Step 2 Step 3 Step 4 Step $ It rcccives alt V, foT 0 < j < n-1; It compute^ Vc"ve Vc1Vi ... It unicasts vc' to each P„ for j f i; It computes D, hy calculating (R,(r| - v,.'): if D,> 0 It searches the list R, till it find!! the largest integer m r) I hat satisfies R,(r) R ni J > D,. Then it sets its flag to 1 and considers V;corresponding to i|s checkpoint Cw is replaced hy Cjj, ) lor the next iteration: Step fi: EoSSSiPj! Step I: /* Checkpoints Ci,„ Cjrj. ■■■■ Q ««■/ are skipped V else 11 set; it* tlay io 0 and considers V,at C,., Tor the itext iteration; It receiies the and V, from eaelt process P,; it' nag " (I for ciieh process ■ 0 < j < n-1 /*Olobtili)'consistent checkpoints belonging to the maximum consistent stole cr? determined V iTrecovery on failure ■ 1 P, asks each process Pj Lo restart I tie application program from its Inst checkpoint corresponding to which Dj; P, implement; Rule I corresponding to iis restartingcheckpoint ai which Dj < 0; tt restans computation from ihe restarting checkpoint (iis GCC); else Pi asks each process Pj to continue iis normal computation from its laiesi checkpoint: P, implements Rule 2 followed hy Rule I: i* Periodic determination ofO'CCs is dene */ |i continues its normal computation from its latest checkpoint; else t'onirol (lows io Step 3: P, receives reciuesi from Pi: if P, has requested lo restart /* The system petturu ufter recovery from u failure*/ Step 2 Step i Step 4 P, implements Rule 1 corresponding to its re start inc clieckpoini si which Dj < ll: It resians compulation from the restarting checkpoint (its GCC I: else if P, has «guested lo continue with the application program P, implements Rule 2 followed by Rule 1: /* Periodic determination of GCC is done */ It continues its normal compulation from its latest checkpoint; else It sends v, corresponding to its latest checkpoint Cj^to the initiator process l\; ll receives yj from Pi, It computes D,by calculating (Rjir) - vV): if Dj>0 It searches the list R, till it finds the largest integer m (< ri that set is lies R (n - Rj(rn) >Dj. Then ii sends a flag of ] and Vjto Pi corresponding io its checkpoint Cj.n(l.e. Cj, is replaced by t^); else /* Ch&kpOvtto C it. .... are skipped */ It sends a 11. i j of D and VjBtCjj to the initiator process P,. Figure 6: The responsibilities of the initiator process ?! and each participating process Pj for the enhanced algorithm. A NOVELL ROLL-BACK MECHANISM FOR... 9 References [1] R. Koo and S. Toueg, "Checkpointing and Rollback-Recovery for Distributed Systems", IEEE trans. Software Engineering, vol. SE-13, no. 1, pp. 23-31, Jan 1987. [2] Y. M. Wang, A. Lowry, and W. K. Fuchs, "Consistent Global Checkpoints Based on Direct Dependency Tracking", Information Processing Letters, vol. 50, no. 4, pp. 223-230, May 1994. [3] K. M. Chandy and L. Lamport, "Distributed Snapshots: Determining Global States of Distributed Systems", ACM Trans. Computing Systems, vol.3, no. 1, pp. 63-75, Feb. 1985. [4] Y. Wang, "Consistent Global Checkpoints that Contain a Given Set of Local Checkpoints", IEEE Trans. Computers, vol. 46, no. 4, pp. 456-468, April 1997. [5] M. Singhal and N. G. Shivaratri, Advanced Concepts in Operating Systems, McGraw-Hill, 1994. [6] S. Venkatesan, T. Juang, and S. Alagar, "Optimistic Crash Recovery Without Changing Application Messages", IEEE Trans. Parallel and Distributed Systems, vol. 8, no.3, pp. 263-271, March 1997. [7] R. Baldoni, F. Quaglai, and P. Fornara, "An Index-Based Checkpointing Algorithm for Autonomous Distributed Systems", IEEE Trans. Parallel and Distributed System, vol. 10, no.2, pp.181-192, Feb. 1999. [8] J. Tsai, S. -Y. Kuo, and Y. -M. Wang, "Theoretical Analysis for Communication-Induced Checkpointing Protocols with Rollback Dependency Trackability", IEEE Trans. Parallel and Distributed Systems, vol. 9, no. 10, pp. 963971, Oct. 1998. [9] G. Cao and M. Singhal, " On Coordinated Checkpointing in Distributed Systems, IEEE Trans. Parallel and Distributed Systems ", vol. 9, no. 12, pp. 1213-1225, Dec.1998. [10] P. Jalote, Fault Tolerance in Distributed Systems, PTR Prentice Hall, (1994), Addison-Wesley, (1998). [11] B. Gupta, S.K. Baneijee and B. Liu, "Design of new roll-forward recovery approach for distributed systems", IEE Proc. Computers and Digital Techniques, Volume 149, Issue 3, pp. 105-112, May 2002. [12] D. Manivannan and M. Singhal, "Quasi-synchronous checkpointing: Models, characterization, and classification", Parallel and Distributed Systems, IEEE Transactions on Volume 10, Issue 7, pp. 703- 713, July 1999. [13] D. Manivannan, and M. Singhal, "Asynchronous recovery without using vector timestamps", Journal of Parallel and Distributed Computing, Volume 62, Issue 62 pp. 1695-1728, Dec 2002. [14] M. Ohara., M. Arai., S. Fukumoto., and K. Iwasaki.,"Finding a Recovery Line in Uncoordinated Checkpointing", Proceedings 24th Informatica 31 (2007) 1-13 13 International Conference on Distributed Computing Systems Workshops (ICDCSW'04), pp. 628 - 633, 2004. [15] R. H. B. Netzer and J. Xu, "Necessary and Sufficient Conditions for Consistent Global Snapshots", IEEE Trans. Parallel and Distributed Systems, vol. 6, no. 2, pp. 165-169, Feb. 1995. [16] T. Juang and S. Venkatesan, "Crash Recovery with Little Overhead", Proc. 11th International Conference on Distributed Computing Systems, pp. 454-461, May 1991. [17] B. Gupta, Y. Yang, S. Rahimi, and A. Vemuri, "A High-Performance Recovery Algorithm for Distributed Systems", Proc. 21st International Conference on Computers and Their Applications, pp. 283-288, Seattle, March 2006. [18] S. Monnet, C. Morin, R. Badrinath, "Hybrid Checkpointing for Parallel Applications in cluster Federations", Proc. 4th IEEE/ACM International Symposium on Cluster Computing and the Grid, Chicago, IL, USA, pp. 773-782, April 2004. [19] J. Cao, Y. Chen, K. Zhang and Y. He, "Checkpointing in Hybrid Distributred Systems", Proc.7th International Symposium on Parallel Architectures, Algorithms and Networks (ISPAN'04), pp. 136-141, May 2004.